How to configure maxshape argument for H5 and append to file?

Question

I'm trying to combine an image dataset into a H5 file. So far I have managed to create the file but when I append to it, it just overwrites what's already there. I've looked at other answers (eg Adding data to existing h5py file along new axis using h5py ) and tried their variations but to no avail.

for i in range(len(files)):
    if i == 0:
        with h5py.File('input_images.h5', 'w') as f:
            img = np.array(Image.open(files[i]))
            f.create_dataset('/array', data = img, maxshape = (None), chunks = True, dtype = img.dtype)
    else:
        with h5py.File('input_images.h5', 'r+') as f:
            img = np.array(Image.open(files[i]))
            f.require_dataset('/array', data = img, shape = img.shape, dtype = img.dtype)
    print(i)

I've tried setting maxshape to (None, None, None) but that just creates an error: ValueError: "maxshape" must have same rank as dataset shape

There are 1000 images in total, each of shape 2048 by 2048. Can someone show me how to fix my code?

Answer 1

Using the maxshape parameter allows you to modify the dataset size. Note, maxshape needs to match of dimensions of your image dataset. You entered 1 dimension, but need 3 for all image data (1000, 2048, 2048). Also the initial dataset size in your code is set from the size of the data=img array size. It will have shape (2048,2048). The dataset needs a third dimension for all image data.
There are 3 approaches to load all your image data:
1. Set shape=(nfiles,a1,a2) to initially size for all images. No need to resize unless you want add more images later.
2. Initially set shape=(1,a1,a2) (for 1 image), then use .resize() to increase the size as you add images. This method is not very efficient as your datasets grow.
3. Initially set shape=(N,a1,a2) (for N images), then use .resize() to increase the size by N when the dataset is full. (N can be any number. I used 10 in the example below, but you might use 100 or 1000 for a real world application).

All 3 methods are in the example below for 30 images w/ a smaller image size. I create random integer data for the images. Replace np.random.randint() with np.array(Image.open(files[i])) for your files.

The examples demonstrates the process. Note that Methods 1 and 2 will only work when you create the HDF5 file and populate the imaged data (because the dataset index is the same as the image counter). Method 3 shows how to add data incrementally. It uses an attribute that counts the number of images loaded. The counter sets the position to add the new image. It is also used to check current dataset size (and resize as needed).

In production code you need additional checks that image size and shape match dataset size and shape.

import h5py
import numpy as np
nfiles=30
a0 = nfiles  # for number of images
a1= 256 ; a2 = 256 # for image size

with h5py.File('input_images1.h5', 'w') as f:    
    for i in range(nfiles):
        img_arr = np.random.randint(0,254, (a1, a2), int)
        if i == 0:
            img_ds = f.create_dataset('/array', shape=(a0,a1,a2), 
                             maxshape = (None,a1,a2), chunks = True)
        f['/array'][i,:,:]=img_arr
        print(i)

with h5py.File('input_images2.h5', 'w') as f:    
    for i in range(nfiles):
        img_arr = np.random.randint(0,254, (a1, a2), int)
        if i == 0:
            img_ds = f.create_dataset('/array', shape=(1,a1,a2), 
                             maxshape = (None,a1,a2), chunks = True)
        else:
            f['/array'].resize(i+1,axis=0)
        f['/array'][i,:,:]=img_arr
        print(i)        

with h5py.File('input_images3.h5', 'a') as f:
    for i in range(nfiles):
        img_arr = np.random.randint(0,254, (a1, a2), int)
        if 'array' not in f.keys() :
            img_ds = f.create_dataset('/array', shape=(10,a1,a2), 
                             maxshape = (None,a1,a2), chunks = True)
            img_ds.attrs['n_images'] = 0
        else:
            img_ds = f['/array']

        n_images = img_ds.attrs['n_images']
        if n_images == img_ds.shape[0] :
            print ('adding 10 rows to /array')
            img_ds .resize(img_ds.shape[0]+10,axis=0)

        img_ds[n_images,:,:]=img_arr
        img_ds.attrs['n_images'] = n_images+1
        print(img_ds.attrs['n_images'])

How to configure maxshape argument for H5 and append to file?

Question

1 answers

solution1
1 ACCPTED 2020-05-14 13:46:18

How to configure maxshape argument for H5 and append to file?

Question

1 answers

solution1 1 ACCPTED 2020-05-14 13:46:18

solution1
1 ACCPTED 2020-05-14 13:46:18