Vacuum HDF5 dataset (to remove rows of data and resize)

Question

Let say I have HDF5 dataset with maxshape=(None,1000), chunk=(1,1000).

Then whenever I need to delete a some row I just zero-it (many):

  ds[ix,:] = 0

What is the fastest way to vacuum-zeroth-rows and resize the array?

Now lets add a twist. I have a dict to resolve symbols =to=> ds_ix

{ name : ds_ix }..

What is the fastest way to vacuum and keep the correct ds_ix?

Answer 1

Did you mean resize the dataset when you asked ' resize the array? ' (Also, I assume you meant maxshape=(None,1000) .) If so, you use the .resize() method. However, if you aren't removing the last row(s), you will have to rearrange the non-zero data, then resize. (And you really don't need to zero out the row(s) since you are going to overwrite them.)
I can think of 2 approaches to rearrange the data: 1) use slice notation to define FROM and TO indices, or 2) read the dataset into a numpy array, delete the rows, and copy it back. Both involve disk I/O so it's not clear which would be faster without testing. It probably doesn't matter for small datasets and only a few deleted rows. I suspect the second method will be better if you plan to delete a lot of rows from large datasets. However, benchmark tests are required to confirm.

Note: be careful setting chunksize. Remember this controls the I/O size, and you will be doing a lot of I/O when you move rows. Setting it too small (or too large) can degrade performance. Setting to (1,1000) is probably too small. Recommended chunk size is 10 KiB to 1 MiB. (1,1000) float32 is 4 Kib.

Here are both approaches with a very small dataset.

Create a HDF5 file:

with h5py.File('SO_73353006.h5','w') as h5f:
    a0, a1 = 10, 5
    arr = np.arange(a0*a1).reshape(a0,a1)
    ds = h5f.create_dataset('test',data=arr,maxshape=(None,a1))

Method 1: move data, then resize dataset

with h5py.File('SO_73353006.h5','r+') as h5f:
    idx = 5
    ds = h5f['test']
    #ds[idx,:] = 0 # Not required since we will overwrite the row
    a0 = ds.shape[0]
    ds[idx:a0-1] = ds[idx+1:a0]
    ds.resize(a0-1,axis=0)

Method 2: extract array, delete row and copy data to resized dataset

with h5py.File('SO_73353006.h5','r+') as h5f:
    idx = 5
    ds = h5f['test']
    a0 = ds.shape[0]
    a1 = ds.shape[1]
    # read dataset into array and delete row
    ds_arr = ds[()]
    ds_arr = np.delete(ds_arr, obj=idx, axis=0)  
    # Resize dataset and load array
    ds.resize(a0-1,axis=0)  # same as above
    ds[:] = ds_arr[:]
    # Create a new dataset for comparison
    ds2 = h5f.create_dataset('test2',data=ds_arr,maxshape=(None,a1))

Vacuum HDF5 dataset (to remove rows of data and resize)

Question

1 answers

solution1
1 2022-08-15 01:56:00

Vacuum HDF5 dataset (to remove rows of data and resize)

Question

1 answers

solution1 1 2022-08-15 01:56:00

solution1
1 2022-08-15 01:56:00