简体繁体 English

有没有办法让一个numpy风格的视图存储在hdf5文件中存储的数组？

[英]Is there a way to get a numpy-style view to a slice of an array stored in a hdf5 file?

原文 2015-01-06 16:50:54 4 2 python/ hdf5/ pytables/ h5py

I have to work on large 3D cubes of data. 我必须处理大型3D立方体数据。 I want to store them in HDF5 files (using h5py or maybe pytables). 我想将它们存储在HDF5文件中（使用h5py或pytables）。 I often want to perform analysis on just a section of these cubes. 我经常想要只对这些立方体的一部分进行分析。 This section is too large to hold in memory. 此部分太大而无法保存在内存中。 I would like to have a numpy style view to my slice of interest, without copying the data to memory (similar to what you could do with a numpy memmap). 我希望有一个numpy样式视图我的兴趣，而不是将数据复制到内存（类似于你可以用numpy memmap做）。 Is this possible? 这可能吗？ As far as I know, performing a slice using h5py, you get a numpy array in memory. 据我所知，使用h5py执行切片，你会在内存中得到一个numpy数组。

It has been asked why I would want to do this, since the data has to enter memory at some point anyway. 有人问我为什么要这样做，因为无论如何数据必须在某个时刻输入内存。 My code, out of necessity, already run piecemeal over data from these cubes, pulling small bits into memory at a time. 出于必要，我的代码已经逐渐对来自这些多维数据集的数据进行零碎处理，一次将少量内容拉入内存。 These functions are simplest if they simply iterate over the entirety of the datasets passed to them. 如果这些函数简单地遍历传递给它们的整个数据集，则这些函数最简单。 If I could have a view to the data on disk, I simply could pass this view to these functions unchanged. 如果我可以查看磁盘上的数据，我只需将此视图传递给这些函数即可。 If I cannot have a view, I need to write all my functions to only iterate over the slice of interest. 如果我不能拥有一个视图，我需要编写所有函数，只迭代感兴趣的片段。 This will add complexity to the code, and make it more likely for human error during analysis. 这将增加代码的复杂性，并使其更有可能在分析期间出现人为错误。

Is there any way to get a view to the data on disk, without copying to memory? 有没有办法在不复制到内存的情况下查看磁盘上的数据？

2 个解决方案

One possibility is to create a generator that yields the elements of the slice one by one. 一种可能性是创建一个生成器，逐个生成切片的元素。 Once you have such a generator, you can pass it to your existing code and iterate through the generator as normal. 一旦有了这样的生成器，就可以将它传递给现有代码并正常迭代生成器。 As an example, you can use a for loop on a generator, just as you might use it on a slice. 例如，您可以在生成器上使用for循环，就像在切片上使用它一样。 Generators do not store all of their values at once, they 'generate' them as needed. 生成器不会立即存储它们的所有值，它们会根据需要“生成”它们。

You might be able create a slice of just the locations of the cube you want, but not the data itself, or you could generate the next location of your slice programmatically if you have too many locations to store in memory as well. 您可以创建一个仅包含所需多维数据集位置的片段，但不能创建数据本身，或者如果您有太多位置存储在内存中，则可以以编程方式生成切片的下一个位置。 A generator could use those locations to yield the data they contain one by one. 生成器可以使用这些位置逐个生成它们包含的数据。

Assuming your slices are the (possibly higher-dimensional) equivalent of cuboids, you might generate coordinates using nested for - range() loops, or by applying product() from the itertools module to range objects. 假设您的切片是（可能更高维度）等效的长方体，您可以使用嵌套的for - range()循环生成坐标，或者通过将itertools模块中的product()应用于范围对象。

It is unavoidable to not copy that section of the dataset to memory. 不能将数据集的该部分复制到内存中是不可避免的。 Reason for that is simply because you are requesting the entire section, not just a small part of it. 原因很简单，因为您要求整个部分，而不仅仅是其中的一小部分。 Therefore, it must be copied completely. 因此，必须完全复制。

So, as h5py already allows you to use HDF5 datasets in the same way as NumPy arrays, you will have to change your code to only request the values in the dataset that you currently need. 因此，由于h5py已经允许您以与NumPy数组相同的方式使用HDF5数据集，因此您必须将代码更改为仅请求当前所需数据集中的值。