简体   繁体   中英

Saving images with HDF5 and cPickle cost much more disk memory than just storing the same amount of image files directly in disk?

I am trying to saving a large amount of images. I want to save them in a format that costs as less disk memory as possible. I have tested with HDF5 and cPickle in python. Surprisingly, I found out that the data files generated by PyTables and cPickle have much larger sizes than the folder that contains the same amount of images.

My code is here:

import cv2
import copy
import cPickle as pickle
import tables
import numpy as np
image = cv2.imread("aloel.jpg")
images = []
for i in xrange(1000):
    images.append(copy.deepcopy(image))
images = np.asarray(images, dtype=np.uint8)
hdf5_path = "img.hdf5"
filters = tables.Filters(complevel=5, complib='blosc')
with tables.open_file(hdf5_path, mode='w', filters=filters) as hdf5_file:
    data_storage = hdf5_file.create_array(hdf5_file.root, 'data', obj=images)

with open('img.pickle', 'wb') as f:
    pickle.dump(images, f, protocol=pickle.HIGHEST_PROTOCOL)

The folder that contains 1000 copies of aloel.jpg consumes 61.5 MB , but the img.hdf5 and img.pickle are both 1.3GB in size.

I wonder why this occurs? If this is the case, does it mean that it would be better to directly save image data into individual image file rather than save them into a pickle file or hdf5 file?

Update: your problem is that compression is not applied at all, because first you need to have chunking, which can be achieved by replacing "create_array" with "create_carray". Then, apply "zlib" with complevel 5 and you should see already some improvement. For this particular case, of course, it makes sense to set chunking also along the repeated data axis, so if you add something like chunkshape=[100,100,100,3] to the create_carray command, you should see a major change.

Jpeg is highly efficient lossy compression algorithm. Blosc is optimised for speed, and pickle is not compressed at all by default. There are other options for HDF5 available, take a look at https://support.hdfgroup.org/services/filters.html and I believe you can find a method that is close enough for original jpeg.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM