how do I aggregate 50 datasets within an HDf5 file

Question

I have an HDF5 file with 2 groups, each containing 50 datasets of 4D numpy arrays of same type per group. I want to combine all 50 datasets in each group into a single dataset. In other words, instead of 2 x 50 datasets I want 2x1 dataset. How can I accomplish this? The file is 18.4 Gb in size. I am a novice at working with large datasets. I am working in python with h5py.

Thanks!

Answer 1

Look at this answer: How can I combine multiple .h5 file? - Method 3b: Merge all data into 1 Resizeable Dataset . It describe a way to copy data from multiple HDF5 files into a single dataset. YOu want to do something similar. The only difference is all datasets you want to copy are in 1 HDF5 file.

I wrote a self-contained example to demonstrate the procedure. First it creates some data and closes the file. Then it reopens the file (read only) and creates a new file for the copied datasets. It loops over the groups and and datasets in the first and copies the data into to a merged dataset in the second file. You didn't say how you want to stack the 4D arrays. I stacked them along axis=3. You can modify the slice notation as desired. Also, this is a simple example that will work for your specific case. If you are writing a general solution, it should check for compatible shapes and dtypes (which I don't do).

Example code below:

import h5py
import numpy as np

# Create a simple H5 file with 2 groups and 5 datasets (shape=a0,a1,a2,a3)
with h5py.File('SO_69937402_2x5.h5','w') as h5f1:
    
    a0,a1,a2,a3 = 100,20,20,10
    grp1 = h5f1.create_group('group1')
    for ds in range(1,6):
        arr = np.random.random(a0*a1*a2*a3).reshape(a0,a1,a2,a3)
        grp1.create_dataset(f'dset_{ds:02d}',data=arr)

    grp2 = h5f1.create_group('group2')
    for ds in range(1,6):
        arr = np.random.random(a0*a1*a2*a3).reshape(a0,a1,a2,a3)
        grp2.create_dataset(f'dset_{ds:02d}',data=arr)        
    
with h5py.File('SO_69937402_2x5.h5','r') as h5f1, \
     h5py.File('SO_69937402_2x1.h5','w') as h5f2:
          
    # loop on groups in existing file (h5f1)
    for grp in h5f1.keys():
        # Create group in h5f2 if it doesn't exist
        print('working on group:',grp)
        h5f2.require_group(grp)
        # Loop on datasets in group
        for ds in h5f1[grp].keys():
            print('working on dataset:',ds)
            if 'merged_ds' not in h5f2[grp].keys():
            # if dataset doesn't exist in group, create it
            # Set maxshape so dataset is resizable
                ds_shape = h5f1[grp][ds].shape
                merge_ds = h5f2[grp].create_dataset('merged_ds',data=h5f1[grp][ds],
                                     maxshape=[ds_shape[0],ds_shape[1],ds_shape[2],None])       
            else:
                # otherwise, resize the merged dataset to hold new values
                ds1_shape = h5f1[grp][ds].shape
                ds2_shape = merge_ds.shape
                merge_ds.resize(ds1_shape[3]+ds2_shape[3],axis=3)
                merge_ds[ :,:,:, ds2_shape[3]:ds2_shape[3]+ds1_shape[3] ] = h5f1[grp][ds]

how do I aggregate 50 datasets within an HDf5 file

Question

1 answers

solution1
1 2021-11-12 15:50:55

how do I aggregate 50 datasets within an HDf5 file

Question

1 answers

solution1 1 2021-11-12 15:50:55

solution1
1 2021-11-12 15:50:55