[英]How can I mmap HDF5 data into multiple Python processes?
I am trying to load HDF5 data from a memory cache (memcached) or the network, and then query it (read only) from multiple Python processes, without making a separate copy of the whole data set. 我试图从内存缓存(memcached)或网络加载HDF5数据,然后从多个Python进程查询它(只读), 而不需要单独复制整个数据集。 Intuitively I would like to mmap the image (as it would appear on disk) into the multiple processes, and then query it from Python.
直观地说,我想将图像(就像它出现在磁盘上)映射到多个进程中,然后从Python中查询它。
I am finding this difficult to achieve, hence the question. 我发现这很难实现,因此这个问题。 Pointers/corrections appreciated.
指针/更正表示赞赏。
File.get_file_image()
which would seem to get the file image. File.get_file_image()
方法,它似乎可以获取文件图像。 What I don't see how to construct a new File / FileNode from a memory image rather than a disk file. ndarray.__new__(buffer=...)
says it will copy the data, and numpy views can only seem to be constructed from existing ndarrays, not raw buffers. ndarray.__new__(buffer=...)
表示它将复制数据,而numpy视图似乎只能从现有的ndarrays而不是原始缓冲区构建。 Value
wrapper to help a little). Value
包装器来帮助一点)。 If I use ctypes directly I can read my mmap'd data without issue, but I would lose all the structural information and help from numpy/pandas/pytables to query it. mmap + the core driver w/ H5py for in-memory read only access. mmap +核心驱动程序w / H5py,用于内存中只读访问。 I submitted a patch for H5py to work with file-images a while ago for scenarios like this.
我提交了一个针对H5py的补丁,以便在此之前的情况下使用文件映像。 Unfortunately it got rejected because upstream didn't want to give users the ability to shoot themselves in the foot and safe buffer management (via the c buffer protocol Python 2.7 introduced) but this required changes HDF's side which I haven't gotten around to.
不幸的是,它被拒绝了,因为上游不想让用户有能力在脚和安全的缓冲区管理(通过引入的Python缓冲区协议Python 2.7),但这需要改变HDF的方面,我没有得到。 Still, if this is important to you and you are careful and capable of building pyHDF yourself, take a look at the patch/pull request here
不过,如果这对你很重要并且你很小心并且能够自己构建pyHDF,那么请看一下这里的补丁/拉取请求
I think the situation should be updated now. 我认为现在情况应该更新。
If a disk file is desirable, Numpy now has a standard, dedicated ndarray subclass: numpy.memmap 如果需要磁盘文件,Numpy现在有一个标准的专用ndarray子类: numpy.memmap
UPDATE: After looked into the implementation of multiprocessing.sharedctypes
(CPython 3.6.2 shared memory block allocation code ), I found that it always creates tmp files to be mmap
ed, so is not really a file-less solution. 更新:在研究了
multiprocessing.sharedctypes
(CPython 3.6.2 共享内存块分配代码 )的实现之后,我发现它总是创建要进行mmap
tmp文件,因此实际上并不是一个无文件的解决方案。
If only pure RAM based sharing is expected
, some one has demoed it with multiprocessing.RawArray: test of shared memory array / numpy integration
如果只需要基于纯RAM的共享
,有人会用多处理演示它.RawArray: 共享内存阵列测试/ numpy集成
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.