简体   繁体   English

加载速度与 memory 的比较:如何从 h5 文件中高效加载大的 arrays

[英]Loading speed vs memory: how to efficiently load large arrays from h5 file

I have been facing the following issue: I have to loop over num_objects = 897 objects, for every single one of which I have to use num_files = 2120 h5 files.我一直面临以下问题:我必须遍历num_objects = 897个对象,对于其中的每一个我都必须使用num_files = 2120 h5 文件。 The files are very large, each being 1.48 GB and the content of interest to me is 3 arrays of floats of size 256 x 256 x 256 contained in every file ( v1 , v2 and v3 ).这些文件非常大,每个都是 1.48 GB,我感兴趣的内容是每个文件( v1v2v3 )中包含的 3 arrays 大小为 256 x 256 x 256 的浮点数。 This is to say, the loop looks like:也就是说,循环看起来像:

for i in range(num_objects):
    ...
    for j in range(num_files):
       some operation with the three 256 x 256 x 256 arrays in each file

and my current way of loading them is by doing the following in the innermost loop:我目前加载它们的方式是在最内层循环中执行以下操作:

f = h5py.File('output_'+str(q)+'.h5','r')
key1 = np.array(f['key1'])
v1=key1[:,:,:,0]
v2=key2[:,:,:,1]
v3=key3[:,:,:,2]

The above option of loading the files every time for every object is obviously very slow.上面每object每次加载文件的选项显然很慢。 On the other hand, loading all files at once and importing them in a dictionary results in excessive use of memory and my job gets killed.另一方面,一次加载所有文件并将它们导入字典会导致过度使用 memory,我的工作被终止了。 Some diagnostics:一些诊断:

  • The above way takes 0.48 seconds per file, per object, resulting in a total of 10.5 days (.) spent only on this operation.上述方式每个文件需要 0.48 秒,每个 object,导致总共 10.5 天 (.) 仅花费在这个操作上。
  • I tried exporting key1 to npz files, but it was actually slower by 0.7 seconds per file.我尝试将key1导出到 npz 文件,但实际上每个文件慢了 0.7 秒。
  • I exported v1 , v2 and v3 individually for every file to npz files (ie 3 npz files for each h5 file), but that saved me only 1.5 day in total.我将每个文件的v1v2v3分别导出到 npz 文件(即每个 h5 文件有 3 个 npz 文件),但这总共只为我节省了 1.5 天。

Does anyone have some other idea/suggestion I could try to be fast and at the same time not limited by excessive memory usage?有没有人有其他想法/建议我可以尝试快速并且同时不受过度使用 memory 的限制?

If I understand, you have 2120.h5 files.据我了解,您有 2120.h5 文件。 Do you only read the 3 arrays in dataset f['key1'] for each file?您是否只为每个文件读取数据集f['key1']中的 3 arrays? (or are there other datasets?) If you only/always read f['key1'] , that's a bottleneck you can't program around. (或者还有其他数据集吗?)如果您只/总是阅读f['key1'] ,那将是您无法通过编程解决的瓶颈。 Using a SSD will help (because I/O is faster than HDD).使用 SSD 会有所帮助(因为 I/O 比 HDD 快)。 Otherwise, you will have to reorganize your data.否则,您将不得不重新组织数据。 The amount of RAM on your system will determine the number of arrays you can read simultaneously.您系统上的 RAM 容量将决定您可以同时读取的 arrays 的数量。 How much RAM do you have?你有多少内存?

You might gain a little speed with a small code change.通过小的代码更改,您可能会获得一点速度。 v1=key1[:,:,:,0] returns v1 as an array (same for v2 and v3). v1=key1[:,:,:,0]以数组形式返回 v1(v2 和 v3 相同)。 There's no need to read dataset f['key1'] into an array.无需将数据集f['key1']读入数组。 Doing that doubles your memory footprint.这样做会使您的 memory 足迹加倍。 (BTW, Is there a reason to convert your arrays to a dictionary?) (顺便说一句,是否有理由将您的 arrays 转换为字典?)

The process below only creates 3 arrays by slicing v1,v2,v3 from the h5py f['key1'] object. It will reduce your memory footprint for each loop by 50%.下面的过程仅通过从 h5py f['key1'] object 中切片v1,v2,v3来创建 3 arrays。它会将每个循环的 memory 占用空间减少 50%。

f = h5py.File('output_'+str(q)+'.h5','r')
key1 = f['key1'] 
## key1 is returned as a h5py dataset OBJECT, not an array
v1=key1[:,:,:,0]
v2=key2[:,:,:,1]
v3=key3[:,:,:,2]

On the HDF5 side, since you always slice out the last axis, your chunk parameters might improve I/O.在 HDF5 方面,由于您总是切出最后一个轴,因此您的块参数可能会改善 I/O。 However, if you want to change the chunk shape, you will have to recreate the.h5 files.但是,如果您想更改块形状,则必须重新创建 .h5 文件。 So, that will probably not save time (at least in the short-term).所以,这可能不会节省时间(至少在短期内)。

Agreed with @kcw78, in my experience the bottleneck is usually that you load too much data compared to what you need.同意@kcw78,根据我的经验,瓶颈通常是你加载的数据比你需要的多。 So don't convert the dataset to an array unless you need to, and slice only the part you need (do you need the whole [:,:,0] or just a subset of it?).所以除非你需要,否则不要将数据集转换为数组,并且只切分你需要的部分(你需要整个[:,:,0]还是只需要它的一个子集?)。

Also, if you can, change the order of the loops, so that you open each file only once.另外,如果可以,请更改循环的顺序,以便每个文件只打开一次。

for j in range(num_files):
    ...
    for i in range(num_objects):
       some operation with the three 256 x 256 x 256 arrays in each file

Another approach would be to use an external tool (like h5copy ) to extract the data you need in a much smaller dataset, and then read it in python, to avoid python's overhead (which might not be that much to be honest).另一种方法是使用外部工具(如h5copy )在更小的数据集中提取您需要的数据,然后在 python 中读取它,以避免 python 的开销(说实话这可能不是那么多)。

Finally, you could use multiprocessing to take advantage of you CPU cores, as long as you tasks are relatively independent from one another.最后,您可以使用多处理来利用您的 CPU 内核,只要您的任务彼此相对独立即可。 You have an example here .在这里有一个例子。

h5py is the data representation class the supports most of NumPy data types and supports traditional NumPy operations like slicing, and a variety of descriptive attributes such as shape and size attributes. h5py是class的数据表示,支持NumPy的大部分数据类型,支持NumPy传统的切片等操作,以及shape、size属性等多种描述性属性。

There is a cool option that you can use for representing datasets in chunks using the create_dataset method with the chunked keyword enabled:有一个很酷的选项,您可以使用启用了 chunked 关键字的 create_dataset 方法以块的形式表示数据集:

dset = f.create_dataset("chunked", (1000, 1000), chunks=(100, 100)) dset = f.create_dataset("分块", (1000, 1000), 分块=(100, 100))

The benefit of this is that you can easily resize your dataset and now as in your case, you would only have to read a chunk and not the whole 1.4 GB of data.这样做的好处是您可以轻松调整数据集的大小,现在就像您的情况一样,您只需读取一个块而不是整个 1.4 GB 的数据。

But beware of chunking implications: different data sizes work best of different chunk sizes.但要注意分块的影响:不同的数据大小最适合不同的块大小。 There is an option of autochunk that selects this chunk size automatically rather than through hit and trial:有一个 autochunk 选项可以自动选择这个块大小,而不是通过命中和试验:

dset = f.create_dataset("autochunk", (1000, 1000), chunks=True) dset = f.create_dataset("autochunk", (1000, 1000), chunks=True)

Cheers干杯

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM