[英]Reading and writing numpy arrays to and from HDF5 files

I am building simulation software, and I need to write (thousands of) 2D numpy arrays into tables in an HDF5 file, where one dimension of the array is variable. 我正在构建模拟软件,我需要将(数千个)2D numpy数组写入HDF5文件中的表中,其中数组的一个维度是可变的。 The incoming array is of float32 type; 传入的array是float32类型; to save disk space every array is stored as a table with appropriate data-types for the columns (hence not using arrays). 为了节省磁盘空间,每个数组都存储为具有适当数据类型的表(因此不使用数组)。 When I read tables, I'd like to retrieve a numpy.ndarray of type float32, so I can do nice calculations for analysis. 当我读表时,我想检索一个float32类型的numpy.ndarray,所以我可以做很好的分析计算。 Below is example code with an array with species A,B, and C plus time. 下面是带有物种A,B和C加上时间的数组的示例代码。

The way I am currently reading and writing 'works' but it is very slow. 我目前阅读和写作的方式“有效”,但速度非常慢。 The question is thus: what is the appropriate way of storing array into table fast, and also reading it back again into ndarrays? 因此问题是:将array快速存储到table的适当方法是什么,并将其再次读回ndarrays? I have been experimenting with numpy.recarray, but I cannot get this to work (type errors, dimension errors, wholly wrong numbers etc.)? 我一直在尝试使用numpy.recarray,但我不能让它工作(类型错误,尺寸错误,完全错误的数字等)?

Code: 码:

import tables as pt
import numpy as np

# Variable dimension

# Example array, rows 0 and 3 should be stored as float32, rows 1 and 2 as uint16
array=(np.random.random((4, var_dim)) * 100).astype(dtype=np.float32)



# This is the table to be stored in
table=hdf.create_table(group,'trajectory', description=particle, expectedrows=var_dim)

# My current way of storing
for i, row in enumerate(array.T):
    table.append([tuple([t(x) for t, x in zip(dtypes, row)])])


# My current way of reading
row_list = []
for i, row in enumerate(array_table.read()):

#The retreived array

# I've tried something with a recarray

# This gives me errors, or wrong results

The error I get: 我得到的错误:

Traceback (most recent call last):
  File "/home/thomas/anaconda3/lib/python3.6/site-packages/numpy/core/records.py", line 475, in __setattr__
    ret = object.__setattr__(self, attr, val)
ValueError: new type not compatible with array.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/thomas/Documents/Thesis/SO.py", line 53, in <module>
  File "/home/thomas/anaconda3/lib/python3.6/site-packages/numpy/core/records.py", line 480, in __setattr__
    raise exctype(value)
ValueError: new type not compatible with array.
Closing remaining open files:test.hdf5...done

As a quick and dirty solution it is possible to aviod loops by temporarily converting the arrays to lists (if you can spare the memory). 作为一种快速而肮脏的解决方案,可以通过临时将阵列转换为列表来避免循环(如果可以节省内存)。 For some reason record arrays are readily converted to/from lists but not to/from conventional arrays. 由于某种原因,记录数组很容易转换为列表或从列表转换而不是转换为传统阵列。

Storing: 储存:


Loading: 加载:

loaded_array = np.array(array_table.read().tolist(), dtype=np.float64).T

There should be a more "Numpythonic" approach to convert between record arrays and conventional arrays, but I'm not familiar enough with the former to know how. 应该有一个更“Numpythonic”的方法来转换记录数组和传统的数组,但我不熟悉前者知道如何。

I haven't worked with tables , but have looked at its files with h5py . 我没有使用过tables ,但是用h5py查看了它的文件。 I'm guessing then that your array or recarray is a structured array with dtype like: 我猜你的arrayrecarray是一个结构化数组,其recarrayrecarray

In [131]: dt=np.dtype('f4,u2,u2,f4')
In [132]: np.array(arr.tolist(), float)
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])
In [133]: arr
array([( 1., 1, 1,  1.), ( 1., 1, 1,  1.), ( 1., 1, 1,  1.)], 
      dtype=[('f0', '<f4'), ('f1', '<u2'), ('f2', '<u2'), ('f3', '<f4')])

Using @kazemakase's tolist approach (which I've recommended in other posts): 使用@kazemakase's tolist方法(我在其他帖子中推荐):

In [134]: np.array(arr.tolist(), float)
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

astype gets the shape all wrong astype的形状都错了

In [135]: arr.astype(np.float32)
Out[135]: array([ 1.,  1.,  1.], dtype=float32)

view works when component dtypes are uniform, for example with the 2 float fields 当组件dtypes是统一的时, view有效,例如使用2个浮点字段

In [136]: arr[['f0','f3']].copy().view(np.float32)
Out[136]: array([ 1.,  1.,  1.,  1.,  1.,  1.], dtype=float32)

But it does require a reshape. 但它确实需要重塑。 view uses the databuffer bytes, just reinterpreting. view使用databuffer字节,只需重新解释。

Many recfunctions functions use a field by field copy. 许多recfunctions函数使用字段副本。 Here the equivalent would be 这里的等价物就是

In [138]: res = np.empty((3,4),'float32')
In [139]: for i in range(4):
     ...:     res[:,i] = arr[arr.dtype.names[i]]
In [140]: res
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]], dtype=float32)

If the number of fields is small compared to the number of records, this iteration is not expensive. 如果字段数与记录数相比较小,则此迭代并不昂贵。

def foo(arr):
    res = np.empty((arr.shape[0],4), np.float32)
    for i in range(4):
        res[:,i] = arr[arr.dtype.names[i]]
    return res

With a large 4 field array, the by-field copy is clearly faster: 使用大型4场阵列,旁场复制显然更快:

In [143]: arr = np.ones(10000, dtype=dt)
In [149]: timeit x1 = foo(arr)
10000 loops, best of 3: 73.5 µs per loop
In [150]: timeit x2 = np.array(arr.tolist(), np.float32)
100 loops, best of 3: 11.9 ms per loop

