简体繁体 English

我可以从多个进程/线程写入HDF5文件吗？

[英]Can I write to a HDF5 file from multiple processes/threads?

原文 2018-02-06 23:07:48 3 2 python/ parallel-processing/ bigdata/ hdf5/ netcdf4

Does hdf5 support parallel writes to the same file, from different threads or from different processes? hdf5是否支持从同一个文件，不同的线程或不同的进程并行写入？ Alternatively, does hdf5 support non-blocking writes? 或者，hdf5是否支持非阻塞写入？

If so then is this also supported by NetCDF4, and by the python bindings for either? 如果是这样，NetCDF4也支持这个，并且python绑定也支持？

I am writing an application where I want different CPU cores to concurrently compute output intended for non-overlapping tiles of a very large output array. 我正在编写一个应用程序，我希望不同的CPU内核同时计算用于非重大输出数组的非重叠切片的输出。 (Later I will want to read sections from it as a single array, without needing my own driver to manage indexing many separate files, and ideally without the additional IO task of rearranging it on disk.) （稍后我会想要将它作为单个数组读取，而不需要我自己的驱动程序来管理索引许多单独的文件，理想情况下没有额外的IO任务在磁盘上重新安排它。）

2 个解决方案

Not trivially, but there various potential work-arounds. 不是简单的，但有各种潜在的解决方案。

The ordinary HDF5 library apparently does not even support concurrent reading of different files by multiple threads. 普通的HDF5库显然甚至不支持多线程并发读取 不同的文件 。 Consequently NetCDF4, and the python bindings for either, will not support parallel writing. 因此，NetCDF4和其中任何一个的python绑定都不支持并行写入。

If the output file is pre-initialised and has chunking and compression disabled, to avoid having a chunk index, then (in principle) concurrent non-overlapping writes to the same file by separate processes might work(?). 如果输出文件是预先初始化的并且禁用了分块和压缩，为了避免使用块索引，那么（原则上）通过单独的进程对同一文件的并发非重叠写入可能有效（？）。

In more recent versions of HDF5, there should be support for virtual datasets. 在更新版本的HDF5中，应该支持虚拟数据集。 Each process would write output to a different file, and afterward a new container file would be created, consisting of references to the individual data files (but otherwise able to be read like a normal HDF5 file). 每个进程都会将输出写入另一个文件，然后会创建一个新的容器文件，其中包含对各个数据文件的引用（但是能够像普通的HDF5文件一样读取）。

There exists a "Parallel HDF5" library for MPI. 存在用于MPI的“并行HDF5”库。 Although MPI might otherwise seem like overkill, it would have advantages if scaling up later to multiple machines. 虽然MPI可能看起来有点矫枉过正，但如果稍后扩展到多台机器，它将具有优势。

If writing output is not a performance bottleneck, a multithreaded application could probably implement one output thread (utilising some form of queue data-structure). 如果写入输出不是性能瓶颈，则多线程应用程序可能实现一个输出线程（利用某种形式的队列数据结构）。

[Edit:] Another option is to use zarr format instead, which places each chunk in a separate file (an approach which future versions of HDF currently seem likely to adopt). [编辑：]另一种选择是使用zarr格式，它将每个块放在一个单独的文件中（HDF的未来版本目前似乎可能采用的方法）。

If you are running in AWS, checkout HDF Cloud: https://www.hdfgroup.org/solutions/hdf-cloud . 如果您在AWS中运行，请查看HDF Cloud： https ： //www.hdfgroup.org/solutions/hdf-cloud 。

This is a service that enables multiple reader/multiple writer workflows and is largely feature compatible with the HDF5 library. 这是一项支持多个读取器/多个写入器工作流程的服务，并且与HDF5库大部分功能兼容。 The client SDK doesn't support non-blocking writes, but of course if you are using the REST API directly you could do non-blocking I/O just like you would with any http-based service. 客户端SDK不支持非阻塞写入，但当然如果您直接使用REST API，则可以像使用任何基于http的服务一样执行非阻塞I / O.