快速有效的从HDF5文件序列化和检索大量numpy数组的方法

Question

I have a huge list of numpy arrays, specifically 113287 , where each array is of shape 36 x 2048 . 我有大量的numpy数组，特别是113287 ，其中每个数组的形状都是36 x 2048 。 In terms of memory, this amounts to 32 Gigabytes . 在内存方面，这相当于32 GB 。

As of now, I have serialized these arrays as a giant HDF5 file. 到目前为止，我已经将这些阵列序列化为一个巨型HDF5文件。 Now, the problem is that retrieving individual arrays from this hdf5 file takes excruciatingly long time (north of 10 mins) for each access. 现在的问题是，从此hdf5文件中检索单个阵列每次访问都花费很长时间（不超过10分钟）。

How can I speed this up? 我怎样才能加快速度？ This is very important for my implementation since I have to index into this list several thousand times for feeding into Deep Neural Networks. 这对于我的实现非常重要，因为我必须索引该列表数千次才能馈入深度神经网络。

Here's how I index into hdf5 file: 这是我索引到hdf5文件的方式：

In [1]: import h5py
In [2]: hf = h5py.File('train_ids.hdf5', 'r')

In [5]: list(hf.keys())[0]
Out[5]: 'img_feats'

In [6]: group_key = list(hf.keys())[0]

In [7]: hf[group_key]
Out[7]: <HDF5 dataset "img_feats": shape (113287, 36, 2048), type "<f4">


# this is where it takes very very long time
In [8]: list(hf[group_key])[-1].shape
Out[8]: (36, 2048)

Any ideas where I can speed things up? 有什么想法可以加快速度吗？ Is there any other way of serializing these arrays for faster access? 还有其他方法可以序列化这些阵列以加快访问速度吗？

Note: I'm using a Python list since I want the order to be preserved (ie to retrieve in the same order as I put it when I created the hdf5 file) 注意：我使用的是Python列表，因为我希望保留顺序（即以与创建hdf5文件时的顺序相同的顺序进行检索）

Answer 1

One way would be to put each sample into its own group and index directly into those. 一种方法是将每个样本放入自己的组中，然后直接将其编入索引。 I am thinking the conversion takes long because it tries to load the entire data set into a list (which it has to read from disk). 我认为转换需要很长时间，因为它试图将整个数据集加载到列表中（必须从磁盘读取）。 Re-organizing the h5 file such that 重新组织h5文件，以便

group 组
- sample 样品
  - 36 x 2048 may help in indexing speed. 36 x 2048可能有助于提高索引速度。

Answer 2

According to Out[7] , "img_feats" is a large 3d array. 根据Out[7] ，“ img_feats”是一个大型3d数组。 (113287, 36, 2048) shape. （113287、36、2048）形状。

Define ds as the dataset (doesn't load anything): 将ds定义为数据集（不加载任何内容）：

ds = hf[group_key]

x = ds[0]    # should be a (36, 2048) array

arr = ds[:]   # should load the whole dataset into memory.
arr = ds[:n]   # load a subset, slice

According to h5py-reading-writing-data : 根据h5py-reading-writing-data ：

HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. HDF5数据集重新使用NumPy切片语法来读取和写入文件。 Slice specifications are translated directly to HDF5 “hyperslab” selections, and are a fast and efficient way to access data in the file. 切片规范直接转换为HDF5“ hyperslab”选择，并且是访问文件中数据的快速有效的方法 。

I don't see any point in wrapping that in list() ; 我认为将其包装在list()没有任何意义； that is, in splitting the 3d array in a list of 113287 2d arrays. 也就是说，将3d数组拆分为113287个2d数组的列表。 There's a clean mapping between 3d datasets on the HDF5 file and numpy arrays. HDF5文件上的3d数据集与numpy数组之间存在清晰的映射。

h5py-fancy-indexing warns that fancy indexing of a dataset is slower. h5py-fancy-indexing警告说，数据集的fancy indexing速度较慢。 That is, seeking to load, say [1, 1000, 3000, 6000] subarrays of that large dataset. 也就是说，试图加载该大型数据集的[1，1000，3000，6000]个子数组。

You might want to experiment with writing and reading some smaller datasets if working with this large one is too confusing. 如果使用这么大的数据集过于混乱，则可能要尝试编写和读取一些较小的数据集。

快速有效的从HDF5文件序列化和检索大量numpy数组的方法

问题描述

2 个解决方案

解决方案1
1 2018-10-27 20:43:41

解决方案2
1 已采纳 2018-10-28 06:02:29

快速有效的从HDF5文件序列化和检索大量numpy数组的方法

问题描述

2 个解决方案

解决方案1 1 2018-10-27 20:43:41

解决方案2 1 已采纳 2018-10-28 06:02:29

解决方案1
1 2018-10-27 20:43:41

解决方案2
1 已采纳 2018-10-28 06:02:29