简体繁体 English

如何将HDF5数据映射到多个Python进程？

[英]How can I mmap HDF5 data into multiple Python processes?

原文 2014-12-26 15:48:42 8 2 python/ mmap/ hdf5/ python-multiprocessing

I am trying to load HDF5 data from a memory cache (memcached) or the network, and then query it (read only) from multiple Python processes, without making a separate copy of the whole data set. 我试图从内存缓存（memcached）或网络加载HDF5数据，然后从多个Python进程查询它（只读），而不需要单独复制整个数据集。 Intuitively I would like to mmap the image (as it would appear on disk) into the multiple processes, and then query it from Python. 直观地说，我想将图像（就像它出现在磁盘上）映射到多个进程中，然后从Python中查询它。

I am finding this difficult to achieve, hence the question. 我发现这很难实现，因此这个问题。 Pointers/corrections appreciated. 指针/更正表示赞赏。

Ideas I have explored so far 到目前为止我已经探索过的想法

pytables - This looks the most promising, it supports rich interface for querying HDF5 data and it (unlike numpy) seems to work with the data without making a (process local) copy of the data. pytables - 这看起来最有前途，它支持用于查询HDF5数据的丰富界面，它（不像numpy）似乎可以处理数据而无需制作（进程本地）数据副本。 It even supports a method File.get_file_image() which would seem to get the file image. 它甚至支持File.get_file_image()方法，它似乎可以获取文件图像。 What I don't see how to construct a new File / FileNode from a memory image rather than a disk file. 我没有看到如何从内存映像而不是磁盘文件构造新的File / FileNode。
h5py - Another way to get at HDF5 data, as with pytables, it seems to require a disk file. h5py - 另一种获取HDF5数据的方法，就像使用pytables一样，它似乎需要一个磁盘文件。 The option driver='core' looks promising, but I can't see how to provide an existing mmap'd region into it, rather than have it allocate its own. 选项driver ='core'看起来很有希望，但我看不到如何将现有的mmap'd区域提供给它，而不是让它自己分配。
numpy - A lower level approach, if I share my raw data via mmap, then I might be able to construct a numpy ndarray which can access this data. numpy - 一种较低级别的方法，如果我通过mmap共享我的原始数据，那么我可能能够构建一个可以访问这些数据的numpy ndarray。 But the relevant constructor ndarray.__new__(buffer=...) says it will copy the data, and numpy views can only seem to be constructed from existing ndarrays, not raw buffers. 但是相关的构造函数ndarray.__new__(buffer=...)表示它将复制数据，而numpy视图似乎只能从现有的ndarrays而不是原始缓冲区构建。
ctypes - Very lowest level approach (could possibly use multiprocessing's Value wrapper to help a little). ctypes - 非常低级的方法（可能使用多处理的Value包装器来帮助一点）。 If I use ctypes directly I can read my mmap'd data without issue, but I would lose all the structural information and help from numpy/pandas/pytables to query it. 如果我直接使用ctypes，我可以毫无问题地读取我的mmap'd数据，但是我会丢失所有的结构信息，并从numpy / pandas / pytables中查询它。
Allocate disk space - I could just allocate a file, write out all the data, and then share it via pytables in all my processes. 分配磁盘空间 - 我可以只分配一个文件，写出所有数据，然后在我的所有进程中通过pytables共享它。 My understanding is this would be memory efficient because pytables doesn't copy (until required) and obviously the processes would share the OS disk cache of the underlying file image. 我的理解是这将是内存效率，因为pytables不会复制（直到需要），显然这些进程将共享底层文件映像的操作系统磁盘缓存。 My objection is it is ugly and brings disk I/O into what I would like to be a pure memory system. 我的反对意见是丑陋并将磁盘I / O带入了我想成为纯粹的内存系统。

2 个解决方案

mmap + the core driver w/ H5py for in-memory read only access. mmap +核心驱动程序w / H5py，用于内存中只读访问。 I submitted a patch for H5py to work with file-images a while ago for scenarios like this. 我提交了一个针对H5py的补丁，以便在此之前的情况下使用文件映像。 Unfortunately it got rejected because upstream didn't want to give users the ability to shoot themselves in the foot and safe buffer management (via the c buffer protocol Python 2.7 introduced) but this required changes HDF's side which I haven't gotten around to. 不幸的是，它被拒绝了，因为上游不想让用户有能力在脚和安全的缓冲区管理（通过引入的Python缓冲区协议Python 2.7），但这需要改变HDF的方面，我没有得到。 Still, if this is important to you and you are careful and capable of building pyHDF yourself, take a look at the patch/pull request here 不过，如果这对你很重要并且你很小心并且能够自己构建pyHDF，那么请看一下这里的补丁/拉取请求

I think the situation should be updated now. 我认为现在情况应该更新。

If a disk file is desirable, Numpy now has a standard, dedicated ndarray subclass: numpy.memmap 如果需要磁盘文件，Numpy现在有一个标准的专用ndarray子类： numpy.memmap

UPDATE: After looked into the implementation of multiprocessing.sharedctypes (CPython 3.6.2 shared memory block allocation code ), I found that it always creates tmp files to be mmap ed, so is not really a file-less solution. 更新：在研究了multiprocessing.sharedctypes （CPython 3.6.2 共享内存块分配代码）的实现之后，我发现它总是创建要进行mmap tmp文件，因此实际上并不是一个无文件的解决方案。

~~If only pure RAM based sharing is expected~~ , some one has demoed it with multiprocessing.RawArray: test of shared memory array / numpy integration ~~如果只需要基于纯RAM的共享~~ ，有人会用多处理演示它.RawArray：共享内存阵列测试/ numpy集成