简体   繁体   中英

Reading and decoding hdf5 string with h5py

I have a hdf5 file that contains a sting that i wish to read into python(2) using the h5py package. The entry reads in h5dump:

DATASET "Name" {
   DATATYPE  H5T_STRING {
      STRSIZE 5;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_UTF8;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SCALAR
   DATA {
   (0): "L_1_3"
   }
}

I can access that file and extract the data as

import h5py
fp = h5py.File("myfile.hdf5","r")
Data=fp.get("Name")

Printing the contents of Data produces <HDF5 dataset "Name": shape (), type "|S5"> . How do i extract the string?

My go-to solution of using np.array(Data) to decode it failed with the message IOError: Can't read data (no appropriate function for conversion path) .

How about this:

import h5py
fp = h5py.File("myfile.hdf5","r")
Data = fp.get("Name")

and then:

print Data[0] # ?

also, you may try to check len(Data) to see if you have any data there.

I'm not familiar with HDF5 C++ syntax. It looks like dataset "Name" has a field named "L_1_3" with a String. There is an easier way to get a NumPy array (and better when working with large datasets). I think this will help you see how to work with HDF5 and h5py. When you print dtype you should see the name for each field. Also, I only use Python3, apologies if there are any syntax errors below.

import h5py
fp = h5py.File("myfile.hdf5","r")
Data = fp["Name"]  ## same as fp.get("Name")

# To work with the h5py dataset object
print Data.dtype, Data.shape
print Data[0]["L_1_3"] ## to get the first row from dataset

# To work with a NumPy array
Data_arr = fp["Name"][:] ## Adding [:] returns a NumPy array.
print Data_arr.dtype, Data_arr.shape
print Data_arr[0][0] ## to get the first row from NumPy array
# This notation might be required, depends on array dtype:
print Data_arr[0]["L_1_3"] ## to get the first row from NumPy array

The code above outlines the basic steps to read data from a HDF5/h5py dataset. Here are additional considerations when working with a h5py dataset object vs a NumPy array. Some of this is explained in my response to a similar post. Link here: Answer to 61464832

It's easy to confuse h5py dataset objects and NumPy arrays. By design, they have similar behavior, but they are not the same. Both have a shape and a data type, support array-style slicing, and can be used with an iterator. Here is a key difference: If you read a dataset into an array, you need sufficient memory to hold all of the data. When you access a dataset object, you do not read all of the data into memory . This is critical when you access huge datasets. In my example above, Data is a dataset object, and Data_arr is a NumPy array with the same data. Memory use doesn't matter with your small dataset. It makes a big difference if your dataset is large, say (8000, 3, 1000, 2000) array of floats. That is almost 48GB values, which requires 384GB of memory (if my math is correct).

You can do many "array like" operations without creating an array. The only time you really need an array is you need to use a NumPy function that requires array inputs.

Here are some examples that show how to work with the Data dataset that are similar to a NumPy array.

import h5py
fp = h5py.File("myfile.hdf5","r")
# iterate on rows in dataset "Name"
# Note how an array does not need to created
# could also use 'Data' object from above: Data = fp["Name"]  
for row in fp["Name"] :
    print row
# Slice the first row from the dataset
firstrow_arr = fp["Name"][0]
# Slice the last column from the dataset
lastcol_arr = fp["Name"][:,-1]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM