简体   繁体   中英

Read HDF5 file into numpy array

I have the following code to read a hdf5 file as a numpy array:

hf = h5py.File('path/to/file', 'r')
n1 = hf.get('dataset_name')
n2 = np.array(n1)

and when I print n2 I get this:

Out[15]:
array([[<HDF5 object reference>, <HDF5 object reference>,
        <HDF5 object reference>, <HDF5 object reference>...

How can I read the HDF5 object reference to view the data stored in it?

The easiest thing is to use the .value attribute of the HDF5 dataset.

>>> hf = h5py.File('/path/to/file', 'r')
>>> data = hf.get('dataset_name').value # `data` is now an ndarray.

You can also slice the dataset, which produces an actual ndarray with the requested data:

>>> hf['dataset_name'][:10] # produces ndarray as well

But keep in mind that in many ways the h5py dataset acts like an ndarray . So you can pass the dataset itself unchanged to most, if not all, NumPy functions. So, for example, this works just fine: np.mean(hf.get('dataset_name')) .

EDIT:

I misunderstood the question originally. The problem isn't loading the numerical data, it's that the dataset actually contains HDF5 references. This is a strange setup, and it's kind of awkward to read in h5py . You need to dereference each reference in the dataset. I'll show it for just one of them.

First, let's create a file and a temporary dataset:

>>> f = h5py.File('tmp.h5', 'w')
>>> ds = f.create_dataset('data', data=np.zeros(10,))

Next, create a reference to it and store a few of them in a dataset.

>>> ref_dtype = h5py.special_dtype(ref=h5py.Reference)
>>> ref_ds = f.create_dataset('data_refs', data=(ds.ref, ds.ref), dtype=ref_dtype)

Then you can read one of these back, in a circuitous way, by getting its name ,and then reading from that actual dataset that is referenced.

>>> name = h5py.h5r.get_name(ref_ds[0], f.id) # 2nd argument is the file identifier
>>> print(name)
b'/data'
>>> out = f[name]
>>> print(out.shape)
(10,)

It's round-about, but it seems to work. The TL;DR is: get the name of the referenced dataset, and read directly from that.

Note:

The h5py.h5r.dereference function seems pretty unhelpful here, despite the name. It returns the ID of the referenced object. This can be read from directly, but it's very easy to cause a crash in this case (I did it several times in this contrived example here). Getting the name and reading from that is much easier.

Note 2:

As stated in the release notes for h5py 2.1 , the use of Dataset.value property is deprecated and should be replaced by using mydataset[...] or mydataset[()] as appropriate.

The property Dataset.value , which dates back to h5py 1.0, is deprecated and will be removed in a later release. This property dumps the entire dataset into a NumPy array. Code using .value should be updated to use NumPy indexing, using mydataset[...] or mydataset[()] as appropriate.

Here is a direct approach to read hdf5 file as a numpy array:

import numpy as np
import h5py

hf = h5py.File('path/to/file.h5', 'r')
n1 = np.array(hf["dataset_name"][:]) #dataset_name is same as hdf5 object name 

print(n1)

h5py provides intrinsic method for such tasks: read_direct()

hf = h5py.File('path/to/file', 'r')
n1 = np.zeros(shape, dtype=numpy_type)
hf['dataset_name'].read_direct(n1)
hf.close()

The combined steps are still faster than n1 = np.array(hf['dataset_name']) if you %timeit . The only drawback is, one needs to know the shape of the dataset beforehand, which can be assigned as an attribute by the data provider.

HDF5 has a simple object model for storing datasets (roughly speaking, the equivalent of an "on file array") and organizing those into groups (think of directories). On top of these two objects types, there are much more powerful features that require layers of understanding.

The one at hand is a " Reference ". It is an internal address in the storage model of HDF5.

h5py will do all the work for you without any calls to obscure routines, as it tries to follow as much as possible a dict-like interface (but for references, it is a bit more complex to make it transparent).

The place to look for in the docs is Object and Region References . It states that to access an object pointed to by reference ref , you do

 my_object = my_file[ref]

In your problems, there are two steps: 1. Get the reference 2. Get the dataset

# Open the file
hf = h5py.File('path/to/file', 'r')
# Obtain the dataset of references
n1 = hf['dataset_name']
# Obtain the dataset pointed to by the first reference
ds = hf[n1[0]]
# Obtain the data in ds
data = ds[:]

If the dataset containing references is 2D, for instance, you must use

ds = hf[n1[0,0]]

If the dataset is scalar, you must use

data = ds[()]

To obtain the all the datasets at once:

all_data = [hf[ref] for ref in n1[:]]

assuming a 1D dataset for n1. For 2D, the idea holds but I don't see a short way to write it.

To get a full idea of how to roundtrip data with references, I wrote short "writer program" and a short "reader program":

import numpy as np
import h5py

# Open file                                                                                    
myfile = h5py.File('myfile.hdf5', 'w')

# Create dataset                                                                               
ds_0 = myfile.create_dataset('dataset_0', data=np.arange(10))
ds_1 = myfile.create_dataset('dataset_1', data=9-np.arange(10))

# Create a data                                                                                
ref_dtype = h5py.special_dtype(ref=h5py.Reference)

ds_refs = myfile.create_dataset('ref_to_dataset', shape=(2,), dtype=ref_dtype)

ds_refs[0] = ds_0.ref
ds_refs[1] = ds_1.ref

myfile.close()

and

import numpy as np
import h5py

# Open file                                                                                    
myfile = h5py.File('myfile.hdf5', 'r')

# Read the references                                                                          
ref_to_ds_0 = myfile['ref_to_dataset'][0]
ref_to_ds_1 = myfile['ref_to_dataset'][1]

# Read the dataset                                                                             
ds_0 = myfile[ref_to_ds_0]
ds_1 = myfile[ref_to_ds_1]

# Read the value in the dataset                                                                
data_0 = ds_0[:]
data_1 = ds_1[:]

myfile.close()

print(data_0)
print(data_1)

You will notice that you cannot use the standard convenient and easy NumPy like syntax for reference datasets. This is because HDF5 references are not representable with the NumPy datatypes. They must be read and written one at a time.

Hi this is the way I use to read hdf5 data, hope it could be usefull to you

with h5py.File('name-of-file.h5', 'r') as hf:
    data = hf['name-of-dataset'][:]

I tried all the answers suggested previously but none of them worked for me. For example, read_direct() method gives an error 'Operation not defined for data type class'. The .value method also does not work. After a lot of struggling I could get around with using the reference itself to get the numpy array.

import numpy as np
import h5py
f = h5py.File('file.mat','r')
data2get = f.get('data2get')[:]

data = np.zeros([data2get.shape[1]])
for i in range(data2get.shape[1]):
    data[i]  = np.array(f[data2get[0][i]])[0][0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM