简体   繁体   中英

Writing data to resized hdf5 dataset fails in surprising ways

I'm trying to create a dataset which I don't know the full size of initially.

I create my dataset with the following properties.

file['data'].create_dataset(
   name='test', shape=(10, len(arr1)), 
   maxshape=(10, None), dtype=float,
   scaleoffset=3, chunks=True, 
   compression='gzip', compression_opts=4, fillvalue=np.nan)

where the final dimension in shape is the dimension I need to expand (initial shape given by first input).

When I resize the dataset for the arr2, everything works fine, but when I try to extend it to the much larger size for arr3, things start to behave strangely.

If I incrementally resize and write each array one after the other, the contents of the dataset becomes corrupted, and values outside of the first arrays length ( arr1 ), in this case 100, are written to the fill value ( nan ), while the first 100 values are stored correctly. Note that this doesn't happen when resizing and writing arr2 , this will correctly write all values of arr2 , while extending the first entry with nan .

I've also tried manually increasing the chunk size, but this fails at using the correct fill value (defaults to 0, rather than nan ) when I write smaller arrays, and unless the chunk size is explicitly larger than the largest array, the largest array is still truncated to the fill value outside of the chunk size.

arr1 = np.arange(0, 100, step=1, dtype=float)
arr2 = np.arange(0, 233, step=1, dtype=float)
arr3 = np.arange(0, 50000, step=1, dtype=float)

file = h5py.File(my_data_file, 'w')
file.create_group('data')
file['data'].create_dataset(
   name='test', shape=(10, len(arr1)), 
   maxshape=(10, None), dtype=float,
   scaleoffset=3, chunks=True, 
   compression='gzip', compression_opts=4, fillvalue=np.nan)

file['data']['test'][0, :len(arr1)] = arr1
try:
    file['data']['test'][1, :len(arr2)] = arr2
except TypeError as e:
    print('New data too large for old dataset, resizing')
    file['data']['test'].resize((10, len(arr2)))
    file['data']['test'][1, :len(arr2)] = arr2

If I stop here, everything looks as expected, but the main problem arises when I run the following code.

try:
    file['data']['test'][2, :len(arr3)] = arr3
except TypeError as e:
    print('New data too large for old dataset, resizing')
    file['data']['test'].resize((10, len(arr3)))
    file['data']['test'][2, :len(arr3)] = arr3

I ran some tests to diagnose. First I ran 3 separate steps, and I see different behavior than you described.
Test 1: arr1 only
Only add arr1 to row 0 and close the file:
Row 0 has correct arr1 values and rows 1-9 are filled with 0.0 , not NaN .
Test 2: arr1 & arr2
Add arr1 to row 0, resize, then add arr2 to row 1, and close the file:
For Columns 0-99: Rows 0 and 1 are filled, and rows 2-9 are filled with 0.0, not NaN . Columns 100+ are all =NaN , FOR ALL ROWS. Note that arr2 values > 99 are not in the file.
Test 3: arr1, arr2, arr3
Loads all 3 arrays following process above:
Similar results to Test 2: For Columns 0-99: Rows 0,1,2 are filled, and rows 3-9 are filled with 0.0, not NaN . Columns 100+ are all =NaN , FOR ALL ROWS. Note that arr2 and arr3 values > 99 are not in the file.

I then reran Test 3 after modifying create_dataset() ; removing the following arguments: scaleoffset=3, chunks=True, compression='gzip', compression_opts=4 . The resulting HDF5 file looks exactly as expected, with NaN everywhere data wasn't added (in Row 0, Columns 100+; Row 1, Columns 233+, and all columns in rows 3-9). See modified call below:

h5f['data'].create_dataset(
   name='test', shape=(10, len(arr1)), 
   maxshape=(10, None), dtype=float, fillvalue=np.nan) 

I don't know enough about the 4 deleted parameters to explain why this works -- only that it does.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM