简体繁体 English

如何在 Python 中持久存储和有效访问非常大的 2D 列表？

[英]How can I persistently store and efficiently access a very large 2D list in Python?

原文 2020-08-26 01:26:01 0 1 python/ database/ storage

In Python, I'm reading in a very large 2D grid of data that consists of around 200,000,000 data points in total.在 Python 中，我正在读取一个非常大的 2D 数据网格，该网格总共包含大约 200,000,000 个数据点。 Each data point is a tuple of 3 floats.每个数据点都是 3 个浮点数的元组。 Reading all of this data into a two dimensional list frequently causes Memory Errors.将所有这些数据读入二维列表经常会导致内存错误。 To get around this Memory Error, I would like to be able to read this data into some sort of table on the hard drive that can be efficiently accessed when given a grid coordinate ie harddrive_table.get(300, 42).为了解决这个内存错误，我希望能够将这些数据读入硬盘驱动器上的某种表中，当给定网格坐标时可以有效地访问该表，即 harddrive_table.get(300, 42)。

So far in my research, I've come across PyTables , which is an implementation of HDF5 and seems like overkill, and the built in shelve library, which uses a dictionary-like method to access saved data, but the keys have to be strings and the performance of converting hundreds of millions of grid coordinates to strings for storage could be too much of a performance hit for my use.到目前为止，在我的研究中，我遇到了PyTables ，它是 HDF5 的一个实现，看起来有点矫枉过正，以及内置的搁置库，它使用类似字典的方法来访问保存的数据，但键必须是字符串并且将数亿个网格坐标转换为字符串进行存储的性能对我的使用来说可能会造成太大的性能损失。

Are there any libraries that allow me to store a 2D table of data on the hard drive with efficient access for a single data point?是否有任何库允许我在硬盘驱动器上存储二维数据表，并能有效访问单个数据点？

This table of data is only needed while the program is running, so I don't care about it's interoperability or how it stores the data on the hard drive as it will be deleted after the program has run.这个数据表只在程序运行时才需要，所以我不关心它的互操作性或它如何将数据存储在硬盘驱动器上，因为它会在程序运行后被删除。

1 个解决方案

HDF5 isn't really overkill if it works.如果它有效，HDF5 并不是真的矫枉过正。 In addition to PyTables there's the somewhat simpler h5py .除了 PyTables 之外，还有更简单的h5py 。
Numpy lets you mmap a file directly into a numpy array. NumPy的让你的mmap直接引用的文件到numpy的阵列。 The values will be stored in the disk file in the minimum-overhead way, with the numpy array shape providing the mapping between array indices and file offsets.这些值将以最小开销的方式存储在磁盘文件中，numpy 数组形状提供数组索引和文件偏移量之间的映射。 mmap uses the same underlying OS mechanisms that power the disk cache to map a disk file into virtual memory, meaning that the whole thing can be loaded into RAM if memory permits, but parts can be flushed to disk (and reloaded later on demand) if it doesn't all fit at once. mmap 使用相同的底层操作系统机制为磁盘缓存提供动力，将磁盘文件映射到虚拟内存，这意味着如果内存允许，可以将整个内容加载到 RAM 中，但如果出现以下情况，可以将部分刷新到磁盘（并在以后按需重新加载）它并不适合一次。