简体   繁体   English

如何使用 h5py 将值并行添加到现有的 HDF5 文件中,其中包含 3 个组和每个组中的 12 个数据集?

[英]How do I add values in parallel to an existing HDF5 file with 3 groups and 12 datasets in each group using h5py?

I have installed the libraries using this link .我已经使用此链接安装了库。 I have already created an HDF5 file called test.h5 using mpiexec -n 1 python3 test.py .我已经使用mpiexec -n 1 python3 test.py创建了一个名为test.h5HDF5文件。 test.py is as below and I'm not sure if it is necessary to use mpi4py here, please let me know. test.py如下,我不确定这里是否有必要使用mpi4py ,请告诉我。

from mpi4py import MPI
import h5py

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

f = h5py.File('test.h5', 'w', driver='mpio', comm=comm)

f.create_group('t1')
f.create_group('t2')
f.create_group('t3')

for i in range(12):
    f['t1'].create_dataset('test{0}'.format(i), (1,), dtype='f', compression='gzip')
    f['t2'].create_dataset('test{0}'.format(i), (1,), dtype='i', compression='gzip')
    f['t3'].create_dataset('test{0}'.format(i), (1,), dtype='i', compression='gzip')

f.close()

Now, I would like to write a test1.py file that will:现在,我想编写一个test1.py文件,它将:

  1. Open test.h5 and get all the unique keys (they are the same for all three groups).打开test.h5并获取所有唯一键(所有三个组都相同)。
  2. Make chunks of those keys, like chunks = [['test0','test1','test2'],['test3','test4','test5'],['test6','test7','test8'],['test9','test10','test11']] .制作这些键的块,比如chunks = [['test0','test1','test2'],['test3','test4','test5'],['test6','test7','test8'],['test9','test10','test11']] I don't care about the the order or groupings of these chunks but I would like one chunk per process.我不关心这些块的顺序或分组,但我希望每个进程一个块。
  3. For each chunk assign a process to store a value for every key in that chunk in every group.对于每个块,分配一个进程来为每个组中该块中的每个键存储一个值。 In other words I would like to run this function in parallel:换句话说,我想并行运行这个 function:
def write_h5(f, rank, chunks):
    for key in chunks[rank]:
        f['t1'][key][:] += 0.5
        f['t2'][key][:] += 1
        f['t3'][key][:] += 1

How do I do this?我该怎么做呢? Can you please explain in detail?你能详细解释一下吗? Thanks a lot in advance!非常感谢!

test1.py should contain: test1.py 应该包含:

from mpi4py import MPI
import h5py

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

def chunk_seq(seq, num):
    avg = len(seq) / float(num)
    out = []
    last = 0.0
    while last < len(seq):
        out.append(seq[int(last):int(last + avg)])
        last += avg
    return out

def write_h5(f, chunk):
    for key in chunk:
        f['t1'][key][:] += 0.5
        f['t2'][key][:] += 1
        f['t3'][key][:] += 1

f = h5py.File('test.h5', 'a', driver='mpio', comm=comm)
chunks = chunk_seq(list(f['t1'].keys()), size)

write_h5(f, chunks[rank])

f.close()

Run it using: mpiexec -n 4 python3 test1.py .使用mpiexec -n 4 python3 test1.py运行它。 The problem is that this will only work if you don't set compression='gzip' while creating the datasets.问题是,只有在创建数据集时没有设置compression='gzip'时,这才会起作用。 For reference check the question Does HDF5 support compression with parallel HDF5?作为参考,检查问题Does HDF5 support compression with parallel HDF5? If not, why?如果不是,为什么? but I'm not sure if this holds true for the latest version.但我不确定这是否适用于最新版本。 Looking at this it seems like you'll have to read each dataset serially and create a corresponding dataset in a new HDF5 file with compression.看看这个,您似乎必须连续读取每个数据集并在压缩的新HDF5文件中创建相应的数据集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM