简体   繁体   English

使用h5py快速切片.h5文件

[英]Fast slicing .h5 files using h5py

I am working with .h5 files with little experience. 我正在使用.h5文件,但经验很少。

In a script I wrote I load in data from an .h5 file. 在我写的脚本中,我从.h5文件加载数据。 The shape of the resulting array is: [3584, 3584, 75] . 所得数组的形状为: [3584, 3584, 75] Here the values 3584 denotes the number of pixels, and 75 denotes the number of time frames. 此处,值3584表示像素数,而值75表示时间帧数。 Loading the data and printing the shape takes 180 ms. 加载数据并打印形状需要180毫秒。 I obtain this time using os.times() . 我使用os.times()获得了这个时间。

If I now want to look at the data at a specific time frame I use the following piece of code: 如果现在我想查看特定时间范围内的数据,请使用以下代码:

data_1 = data[:, :, 1]

The slicing takes up a lot of time (1.76 s). 切片需要大量时间(1.76 s)。 I understand that my 2D array is huge but at some point I would like to loop over time which will take very long as I'm performing this slice within the for loop. 我知道我的2D数组很大,但是在某些时候我想随着时间循环,因为我要在for循环中执行此切片,所以这会花费很长时间。

Is there a more effective/less time consuming way of slicing the time frames or handling this type of data? 有没有一种更有效/更省时的方式来分割时间范围或处理此类数据?

Thank you! 谢谢!

Note : I'm making assumptions here since I'm unfamiliar with .H5 files and the Python code the accesses them. 注意 :由于我不熟悉.H5文件,因此使用Python进行访问。

I think that what is happening is that when you "load" the array, you're not actually loading an array. 我认为正在发生的事情是,当您“加载”数组时,实际上并没有加载数组。 Instead, I think that an object is constructed on top of the file. 相反,我认为在文件顶部构造了一个对象。 It probably reads in dimensions and information related to how the file is organized, but it doesn't read the whole file. 它可能会读取与文件的组织方式有关的维度和信息,但不会读取整个文件。

That object mimicks an array so good that when you later on perform the slice operation, the normal Python slice operation can be executed, but at this point the actual data is being read. 该对象很好地模仿了一个数组,以至于以后您执行切片操作时,可以执行常规的Python切片操作,但是此时正在读取实际数据。 That's why the slice takes so long time compared to "loading" all the data. 这就是为什么与“加载”所有数据相比,切片需要这么长时间的原因。

I arrive at this conclusion because of the following. 由于以下原因,我得出了这个结论。

If you're reading 75 frames of 3584x3584 pixels, I'm assuming they're uncompressed (H5 seems to be just raw dumps of data), and in that case, 75 * 3.584 * 3.584 = 963.379.200, this is around 918MB of data. 如果您正在读取75张3584x3584像素的帧,我假设它们是未压缩的(H5似乎只是原始数据转储),那么在这种情况下,75 * 3.584 * 3.584 = 963.379.200,这大约是918MB数据的。 Couple that with you "reading" this in 180ms, we get this calculation: 加上您在180毫秒内“读取”的内容,我们得出以下计算结果:

918MB / 180ms = 5.1GB/second reading speed

Note, this number is for 1-byte pixels, which is also unlikely. 请注意,此数字用于1字节像素,这也不大可能。

This speed thus seems highly unlikely, as even the best SSDs today reach way below 1GB/sec. 因此,这种速度似乎极不可能,因为即使是当今最好的SSD,其速度也低于1GB /秒。

It seems much more plausible that an object is just constructed on top of the file and the slice operation incurs the cost of reading at least 1 frame worth of data. 似乎更合理的做法是,仅在文件顶部构造对象,而切片操作会导致读取至少一帧数据的成本。

If we divide the speed by 75 to get per-frame speed, we get 68MB/sec speed for 1-byte pixels, and with 24 or 32-bit pixels we get up to 270MB/sec reading speeds. 如果将速度除以75得到每帧速度,则1字节像素的速度为68MB /秒,而24或32位像素的速度为270MB /秒。 Much more plausible. 似乎更合理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM