简体   繁体   English

读取大的.h5文件时出现内存错误

[英]Memory error while reading a large .h5 file

I have created a .h5 from numpy array 我已经从numpy数组创建了一个.h5

h5f = h5py.File('/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/miniTrees/JZ3W.h5', 'w')
h5f.create_dataset('JZ3WPpxpypz', data=all, compression="gzip")

HDF5 dataset "JZ3WPpxpypz": shape (19494500, 376), type "f8" HDF5数据集“ JZ3WPpxpypz”:形状(19494500,376),类型为“ f8”

But I am getting a memory error while reading the .h5 file to a numpy array 但是在将.h5文件读取到numpy数组时出现内存错误

filename = '/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/miniTrees/JZ3W.h5'
h5 = h5py.File(filename,'r')

h5.keys()

[u'JZ3WPpxpypz'] [u'JZ3WPpxpypz']

data = h5['JZ3WPpxpypz']

If I try to see the array it gives me memory error 如果我尝试查看数组,它将给我内存错误

data[:]

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-33-629f56f97409> in <module>()
----> 1 data[:]

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

/home/debo/env_autoencoder/local/lib/python2.7/site-packages/h5py/_hl/dataset.pyc in __getitem__(self, args)
    560         single_element = selection.mshape == ()
    561         mshape = (1,) if single_element else selection.mshape
--> 562         arr = numpy.ndarray(mshape, new_dtype, order='C')
    563 
    564         # HDF5 has a bug where if the memory shape has a different rank

MemoryError: 

Is there any memory efficient way to read .h5 file into numpy array? 有什么内存有效的方法可以将.h5文件读取到numpy数组中?

Thanks, Debo. 谢谢Debo

You don't need to call numpy.ndarray() to get an array. 您无需调用numpy.ndarray()即可获取数组。 Try this: 尝试这个:

arr = h5['JZ3WPpxpypz'][:]
# or
arr = data[:]

Adding [:] returns the array (different from your data variable -- it simply references the HDF5 dataset). 添加[:]将返回数组(与data变量不同-它仅引用HDF5数据集)。 Either method should give you an array of the same dtype and shape as the original array. 两种方法都应为您提供与原始数组相同的dtype和形状的数组。 You can also use numpy slicing operations to get subsets of the array. 您还可以使用numpy切片操作来获取数组的子集。

A clarification is in order. 需要进行澄清。 I overlooked that numpy.ndarray() was called as part of the process to print data[:] . 我忽略了numpy.ndarray()在打印data[:]的过程中被调用。 Here are type checks to show the difference in the returns from the 2 calls: 以下是类型检查,以显示这两次调用的收益差异:

# check type for each variable:
data = h5['JZ3WPpxpypz']
print (type(data))
# versus
arr = data[:]
print (type(arr))

Output will look like this: 输出将如下所示:

<class 'h5py._hl.dataset.Dataset'>
<class 'numpy.ndarray'>

In general, h5py dataset behavior is similar to numpy arrays (by design). 通常,h5py数据集的行为类似于numpy数组(根据设计)。 However, they are not the same . 但是,它们并不相同 When you tried to print the dataset contents with this call ( data[:] ), h5py tried to convert the dataset to a numpy array in the background with numpy.ndarray() . 当您尝试通过此调用( data[:] )打印数据集内容时,h5py尝试使用numpy.ndarray()在后台将数据集转换为numpy数组。 It would have worked if you had a smaller dataset or sufficient memory. 如果您有一个较小的数据集或足够的内存,它会起作用。

My takeaway: calling arr = h5['JZ3WPpxpypz'][:] creates the numpy array with a process that does not call numpy.ndarray() . arr = h5['JZ3WPpxpypz'][:] :调用arr = h5['JZ3WPpxpypz'][:]使用不调用numpy.ndarray()的进程创建numpy数组。

When you have very large datasets, you may run into situations where you can't create an array with arr= h5f['dataset'][:] because the dataset is too large to fit into memory as a numpy array. 当数据集非常大时,您可能会遇到无法使用arr= h5f['dataset'][:]创建数组的情况,因为数据集太大而无法作为numpy数组放入内存。 When this occurs, you can create the h5py dataset object, then access subsets of the data with slicing notation, like this trivial example: 发生这种情况时,您可以创建h5py数据集对象,然后使用切片符号访问数据的子集,例如以下简单示例:

data = h5['JZ3WPpxpypz']
arr1 = data[0:100000]
arr2 = data[100000:200000])
# etc

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM