Loading only a part of the data in a HDF5 file into memory with Python

Question

To load data in a HDF5 file into memory, one can use pandas.read_hdf function with a list of columns to load. However, this way an entire table is loaded into memory, and then some columns are dropped. Thus the initial memory usage is much larger than the actual size of the data.

Is there a way to load only the columns of interest?

Answer 1

Nownuri, Both offer methods to read part of the file.
With pytables , there are several methods to read a table into a numpy array. These include:

table.read() lets you slice the data,
table.read_coordinates() reads a set [noconsecutive] coordinates (aka rows),
table.read_where() read a set of based on a search condition

All support an optional field='' parameter to read a single column of data based on the field name (like a numpy recarry). For complete details, read the Pytables documentation. You can find it here: PyTables User Guide

h5py has similar (but different) methods based on numpy array slicing conventions. For h5py details, access the documentation here: H5py Documentation

Below are very simple (self-contained) examples for each. I create the data in write mode, then reopen the file in read mode. You probably only need the second half of each example (how to read the data). Also HDF5 files are independent of creation method: you could read either HDF5 file with h5py or pytables (independent of how they were created).

Pytables method:
This method shows 2 different ways to access a table w/ pytables. The first uses 'Natural Naming' to get h5_i_arr, and the second uses the get_node() method to read h5_x_arr .

import tables as tb
import numpy as np

with tb.File('SO_57342918_tb.h5','w') as h5f:

    i_arr=np.arange(10)
    x_arr=np.arange(10.0)

    my_dt = np.dtype([ ('i_arr', int), ('x_arr', float) ] )
    table_arr = np.recarray( (10,), dtype=my_dt )
    table_arr['i_arr'] = i_arr
    table_arr['x_arr'] = x_arr

    my_ds = h5f.create_table('/','ds1',obj=table_arr)

# read 1 column using field= parameter:   
with tb.File('SO_57342918_tb.h5','r') as h5f:

    h5_i_arr = h5f.root.ds1.read(field='i_arr')
    h5_x_arr = h5f.get_node('/ds1').read(field='x_arr')
    print (h5_i_arr)
    print (h5_x_arr)

h5py method:

import h5py
import numpy as np

with h5py.File('SO_57342918_h5py.h5','w') as h5f:

    i_arr=np.arange(10)
    x_arr=np.arange(10.0)

    my_dt = np.dtype([ ('i_arr', int), ('x_arr', float) ] )
    table_arr = np.recarray( (10,), dtype=my_dt )
    table_arr['i_arr'] = i_arr
    table_arr['x_arr'] = x_arr

    my_ds = h5f.create_dataset('/ds1',data=table_arr)

# read 1 column using numpy slicing: 
with h5py.File('SO_57342918_h5py.h5','r') as h5f:

    h5_i_arr = h5f['ds1'][:,'i_arr']
    h5_x_arr = h5f['ds1'][:,'x_arr']
    print (h5_i_arr)
    print (h5_x_arr)

Loading only a part of the data in a HDF5 file into memory with Python

Question

1 answers

solution1
1 ACCPTED 2019-08-04 15:22:20

Loading only a part of the data in a HDF5 file into memory with Python

Question

1 answers

solution1 1 ACCPTED 2019-08-04 15:22:20

solution1
1 ACCPTED 2019-08-04 15:22:20