简体   繁体   English

我可以从多个进程/线程写入HDF5文件吗?

[英]Can I write to a HDF5 file from multiple processes/threads?

Does hdf5 support parallel writes to the same file, from different threads or from different processes? hdf5是否支持从同一个文件,不同的线程或不同的进程并行写入? Alternatively, does hdf5 support non-blocking writes? 或者,hdf5是否支持非阻塞写入?

If so then is this also supported by NetCDF4, and by the python bindings for either? 如果是这样,NetCDF4也支持这个,并且python绑定也支持?

I am writing an application where I want different CPU cores to concurrently compute output intended for non-overlapping tiles of a very large output array. 我正在编写一个应用程序,我希望不同的CPU内核同时计算用于非重大输出数组的非重叠切片的输出。 (Later I will want to read sections from it as a single array, without needing my own driver to manage indexing many separate files, and ideally without the additional IO task of rearranging it on disk.) (稍后我会想要将它作为单个数组读取,而不需要我自己的驱动程序来管理索引许多单独的文件,理想情况下没有额外的IO任务在磁盘上重新安排它。)

Not trivially, but there various potential work-arounds. 不是简单的,但有各种潜在的解决方案。

The ordinary HDF5 library apparently does not even support concurrent reading of different files by multiple threads. 普通的HDF5库显然甚至不支持多线程并发读取 不同的文件 Consequently NetCDF4, and the python bindings for either, will not support parallel writing. 因此,NetCDF4和其中任何一个的python绑定都不支持并行写入。

If the output file is pre-initialised and has chunking and compression disabled, to avoid having a chunk index, then (in principle) concurrent non-overlapping writes to the same file by separate processes might work(?). 如果输出文件是预先初始化的并且禁用了分块和压缩,为了避免使用块索引,那么(原则上)通过单独的进程对同一文件的并发非重叠写入可能有效(?)。

In more recent versions of HDF5, there should be support for virtual datasets. 在更新版本的HDF5中,应该支持虚拟数据集。 Each process would write output to a different file, and afterward a new container file would be created, consisting of references to the individual data files (but otherwise able to be read like a normal HDF5 file). 每个进程都会将输出写入另一个文件,然后会创建一个新的容器文件,其中包含对各个数据文件的引用(但是能够像普通的HDF5文件一样读取)。

There exists a "Parallel HDF5" library for MPI. 存在用于MPI的“并行HDF5”库。 Although MPI might otherwise seem like overkill, it would have advantages if scaling up later to multiple machines. 虽然MPI可能看起来有点矫枉过正,但如果稍后扩展到多台机器,它将具有优势。

If writing output is not a performance bottleneck, a multithreaded application could probably implement one output thread (utilising some form of queue data-structure). 如果写入输出不是性能瓶颈,则多线程应用程序可能实现一个输出线程(利用某种形式的队列数据结构)。

[Edit:] Another option is to use zarr format instead, which places each chunk in a separate file (an approach which future versions of HDF currently seem likely to adopt). [编辑:]另一种选择是使用zarr格式,它将每个块放在一个单独的文件中(HDF的未来版本目前似乎可能采用的方法)。

If you are running in AWS, checkout HDF Cloud: https://www.hdfgroup.org/solutions/hdf-cloud . 如果您在AWS中运行,请查看HDF Cloud: https//www.hdfgroup.org/solutions/hdf-cloud

This is a service that enables multiple reader/multiple writer workflows and is largely feature compatible with the HDF5 library. 这是一项支持多个读取器/多个写入器工作流程的服务,并且与HDF5库大部分功能兼容。 The client SDK doesn't support non-blocking writes, but of course if you are using the REST API directly you could do non-blocking I/O just like you would with any http-based service. 客户端SDK不支持非阻塞写入,但当然如果您直接使用REST API,则可以像使用任何基于http的服务一样执行非阻塞I / O.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM