Why can I process a large file only when I don't fix HDF5 deprecation warning?

Question

After receiving the H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead. H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead. warning , I changed my code to:

import h5py
import numpy as np 

f = h5py.File('myfile.hdf5', mode='r')
foo = f['foo']
bar = f['bar']
N, C, H, W = foo.shape. # (8192, 3, 1080, 1920)
data_foo = np.array(foo[()]) # [()] equivalent to .value

and when I tried to read a (not so) big file of images, I got a Killed: 9 on my terminal, my process was killed because it was consuming too much memory, on the last line of the code, despite that archaic comment of mine there. .

However, my original code:

f = h5py.File('myfile.hdf5', mode='r')
data_foo = f.get('foo').value
# script's logic after that worked, process not killed

worked just fine, except from the issued warning..

Why did my code work?

Answer 1

Let me explain what your code is doing, and why you are getting memory errors. First some HDF5/h5py basics. (The h5py docs are an excellent starting point. Check here: h5py QuickStart )

foo = f['foo'] and foo = f.get('foo') both return a h5py dataset object named 'foo'.(Note: it's more common to see this as foo = f['foo'] , but nothing wrong with the get() method.) A dataset object is not the same as a NumPy array. Datasets behave like NumPy arrays; both have a shape and a data type, and support array-style slicing. However, when you access a dataset object, you do not read all of the data into memory. As a result, they require less memory to access. This is important when working with large datasets!

This statement returns a Numpy array: data_foo = f.get('foo').value . The preferred method is data_foo = f['foo'][:] . (NumPy slicing notation is the way to return a NumPy array from a dataset object. As you discovered, .value is deprecated.)
This also returns a Numpy array: data_foo = foo[()] (assuming foo is defined as above).
So, when you enter this equation data_foo = np.array(foo[()]) you are creating a new NumPy array from another array ( foo[()] is the input object). I suspect your process was killed because the amount of memory to create a copy of a (8192, 3, 1080, 1920) array exceeded your system resources. That statement will work for small datasets/arrays. However, it's not good practice.

Here's an example to show how to use the different methods (h5py dataset object vs NumPy array).

h5f = h5py.File('myfile.hdf5', mode='r')

# This returns a h5py object:
foo_ds = h5f['foo']
# You can slice to get elements like this:
foo_slice1 = foo_ds[0,:,:,:] # first row
foo_slice2 = foo_ds[-1,:,:,:] # last row

# This is the recommended method to get a Numpy array of the entire dataset:
foo_arr = h5f['foo'][:]
# or, referencing h5py dataset object above
foo_arr = foo_ds[:] 
# you can also create an array with a slice
foo_slice1 = h5f['foo'][0,:,:,:] 
# is the same as (from above):
foo_slice1 = foo_ds[0,:,:,:]

Why can I process a large file only when I don't fix HDF5 deprecation warning?

Question

1 answers

solution1
1 ACCPTED 2020-04-27 18:43:21

Why can I process a large file only when I don't fix HDF5 deprecation warning?

Question

1 answers

solution1 1 ACCPTED 2020-04-27 18:43:21

solution1
1 ACCPTED 2020-04-27 18:43:21