简体   繁体   中英

h5py: How to organize HDF5 file to efficiently read mixed data-types objects

I'm currently working with python3.x and using the h5py library to write/read HDF5 files.

Let's suppose that I have a large number of elements containing properties of mixed data types. I want to store them in an HDF5 file so that single elements can be read as efficiently as possible, by index.

As an example, let's suppose that I have the following data:

item_1 = {'string_name': 'Paul', 'float_height': 5.9, 'int_age':27, 'numpy_data': np.array([5.4, 6.7, 8.8])}
item_2 = {'string_name': 'John', 'float_height': 5.7, 'int_age':31, 'numpy_data': np.array([3.1, 58.4, 66.4])}
...
item_1000000 = {'string_name': 'Anna', 'float_height': 6.1, 'int_age':33, 'numpy_data': np.array([4.7, 5.1, 4.2])}

The most trivial solution that I found out was to store each property in a separate array, and then store each array separately inside the HDF5 file.

string_names = ['Paul', 'John', ... , 'Anna']
float_heights = [5.9, 5.7, ... , 6.1]
int_ages = [27, 31, ... , 33]
numpy_data = big_numpy_array_of_shape_1000000_by_3

Then, as an example, to retrieve the third element I must read the element at index "2" for each of the four arrays.

This solution works perfectly fine, but my guess is that it is a very inefficient solution because four read operations are needed to retrieve every single element.

Any suggestions?

As @hpaulj noted, the key is to create a record array (and/or dtype) and reference when you create your dataset. There are A LOT of ways to load the data. I created an example using your list data (below) that shows the 2 easiest (IMHO). Read the reference for all of the methods. I'm not sure if you can load from a dictionary. I'm sure it's possible with sufficient Python and NumPy magic.

import h5py
import numpy as np

string_names = ['Paul', 'John', 'Anna']
float_heights = [5.9, 5.7,  6.1]
int_ages = [27, 31, 33]
numpy_data = [ np.array([5.4, 6.7, 8.8]), 
               np.array([3.1, 58.4, 66.4]),
               np.array([4.7, 5.1, 4.2])  ] 

# Create empty record array with 3 rows
ds_dtype = [('name','S50'), ('height',float), ('ages',int), ('numpy_data', float, (3,) ) ]
ds_arr = np.recarray((3,),dtype=ds_dtype)
# load list data to record array by field name
ds_arr['name'] = np.asarray(string_names)
ds_arr['height'] = np.asarray(float_heights)
ds_arr['ages'] = np.asarray(int_ages)
ds_arr['numpy_data'] = np.asarray(numpy_data)

with h5py.File('SO_59483094.h5', 'w') as h5f:
# load data to dataset my_ds1 using recarray
    dset = h5f.create_dataset('my_ds1', data=ds_arr, maxshape=(None) )
# load data to dataset my_ds2 by lists/field names
    dset = h5f.create_dataset('my_ds2', dtype=ds_dtype, shape=(100,), maxshape=(None) )
    dset['name',0:3] = np.asarray(string_names)
    dset['height',0:3] = np.asarray(float_heights)
    dset['ages',0:3] = np.asarray(int_ages)
    dset['numpy_data',0:3] = np.asarray(numpy_data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM