简体   繁体   English

如何在python中安全地将单个hdf5文件中的数据并行写入多个文件?

[英]how do I safely write data from a single hdf5 file to multiple files in parallel in python?

I am trying to write my data (from a single file in hdf5 format) to multiple files, and it works fine when the task is executed in serial. 我正在尝试将我的数据(从hdf5格式的单个文件)写入多个文件,并且在串行执行任务时它可以正常工作。 Now I want to improve the efficiency and modify the code using the multiprocessing module, but the output sometimes go wrong. 现在我想提高效率并使用multiprocessing模块修改代码,但输出有时会出错。 Here's a simplified version of my code. 这是我的代码的简化版本。

import multiprocessing as mp
import numpy as np
import math, h5py, time
N = 4  # number of processes to use
block_size = 300
data_sz = 678
dataFile = 'mydata.h5'

# fake some data
mydata = np.zeros((data_sz, 1))
for i in range(data_sz):
    mydata[i, 0] = i+1
h5file = h5py.File(dataFile, 'w')
h5file.create_dataset('train', data=mydata)

# fire multiple workers
pool = mp.Pool(processes=N)
total_part = int(math.ceil(1. * data_sz / block_size))
for i in range(total_part):
    pool.apply_async(data_write_func, args=(dataFile, i, ))
pool.close()
pool.join()

and the data_write_func() 's structure is: data_write_func()的结构是:

def data_write_func(h5file_dir, i, block_size=block_size):
    hf = h5py.File(h5file_dir)
    fout = open('data_part_' + str(i), 'w')
    data_part = hf['train'][block_size*i : min(block_size*(i+1), data_sz)]  # np.ndarray
    for line in data_part:
        # do some processing, that takes a while...
        time.sleep(0.01)
        # then write out..
        fout.write(str(line[0]) + '\n')
    fout.close()

when I set N=1 , it works well. 当我设置N=1 ,它运作良好。 but when I set N=2 or N=4 , the result get messed sometimes(not every time!). 但是当我设置N=2N=4 ,结果有时会混乱(不是每次都!)。 eg in data_part_1 I expect the output to be: 例如在data_part_1中我期望输出为:

301,
302,
303,
...

But sometimes what I get is 但有时我得到的是

0,
0,
0,
...

sometimes I get 有时我会

379,
380,
381,
...

I'm new to the multiprocessing module, and find it tricky. 我是多处理模块的新手,发现它很棘手。 Appreciate it if any suggestions! 如果有任何建议,请欣赏它!

After fixing the fout.write and mydata=... as Andriy suggested your program works as intended, because every process writes to his own file. 修复fout.writemydata=... ,Andriy建议您的程序按预期工作,因为每个进程都写入自己的文件。 There's no way the processes intermingle with each other. 这些过程无法相互融合。

What you probaby wanted to do is using multiprocessing.map() which cuts your iterable for you (so you don't need to do the block_size thingies), plus it guarantees that the results are done in order. probaby想做的使用是multiprocessing.map()其削减你的迭代你(所以你不需要做block_size一样的东西),再加上它保证了结果的顺序进行。 I've reworked your code to use multiprocessing map: 我重写了你的代码以使用多处理映射:

import multiprocessing
from functools import partial
import pprint

def data_write_func(line):
  i = multiprocessing.current_process()._identity[0]
  line = [i*2 for i in line]
  files[i-1].write(",".join((str(s) for s in line)) + "\n")

N = 4
mydata=[[x+1,x+2,x+3,x+4] for x in range(0,4000*N,4)] # fake some data
files = [open('data_part_'+str(i), 'w') for i in range(N)]

pool = multiprocessing.Pool(processes=N)
pool.map(data_write_func, mydata)
pool.close()
pool.join()

Please note: 请注意:

  • i is taken from the process itself, it's either 1 or 2 我是从这个过程中取出来的,它是1或2
  • as now data_write_func is called for every row, the file opening needs to be done in the parent process. 因为现在为每一行调用data_write_func ,所以需要在父进程中完成文件打开。 Also: you don't need to do the close() the file manually, the OS will do that for you on exit of your python program. 另外:你不需要手动close()文件,操作系统会在退出你的python程序时为你做这件事。

Now, I guess in the end you'd want to have all the output in one file, not in separate files. 现在,我想最后你想要将所有输出都放在一个文件中,而不是放在单独的文件中。 If your output line is below 4096 bytes on linux (or below 512 bytes on OSX, for other OSes see here ) you're actually safe to just open one file (in append mode) and let every process just write into that one file, as writes below these sizes are guaranteed to be atomic by Unix. 如果您的输出行在Linux上低于4096字节(或在OSX上低于512字节,对于其他操作系统,请参见此处 ),您实际上只需打开一个文件(在附加模式下)并让每个进程只写入该文件,如下所述,这些大小保证是Unix的原子。

Update : 更新

"What if the data is stored in hdf5 file as dataset?" “如果数据作为数据集存储在hdf5文件中怎么办?”

According to hdf5 doc this works out of the box since version 2.2.0 : 根据hdf5 doc,自2.2.0版以来,它开箱即用

Parallel HDF5 is a configuration of the HDF5 library which lets you share open files across multiple parallel processes. 并行HDF5是HDF5库的一种配置,允许您跨多个并行进程共享打开的文件。 It uses the MPI (Message Passing Interface) standard for interprocess communication 它使用MPI(消息传递接口)标准进行进程间通信

So if you do this in your code: 因此,如果您在代码中执行此操作:

h5file = h5py.File(dataFile, 'w')
dset = h5file.create_dataset('train', data=mydata)

Then you can just access dset from within your process and read/write to it without taking any extra measures. 然后你可以从你的进程中访问dset并读取/写入它而不需要采取任何额外的措施。 See also this example from h5py using multiprocessing 另请参阅h5py中使用多处理的此示例

The issue could not be replicated. 这个问题无法复制。 Here is my full code: 这是我的完整代码:

#!/usr/bin/env python

import multiprocessing

N = 4
mydata=[[x+1,x+2,x+3,x+4] for x in range(0,4000*N,4)] # fake some data

def data_write_func(mydata, i, block_size=1000):
    fout = open('data_part_'+str(i), 'w')
    data_part = mydata[block_size*i: block_size*i+block_size]
    for line in data_part:
        # do some processing, say *2 for each element...
        line = [x*2 for x in line]
        # then write out..
        fout.write(','.join(map(str,line))+'\n')
    fout.close()

pool = multiprocessing.Pool(processes=N)
for i in range(2):
      pool.apply_async(data_write_func, (mydata, i, ))
pool.close()
pool.join()

Sample output from data_part_0 : data_part_0示例输出:

2,4,6,8
10,12,14,16
18,20,22,24
26,28,30,32
34,36,38,40
42,44,46,48
50,52,54,56
58,60,62,64

multiprocessing cannot guarantee the order of code execution between different threads, it is perfectly reasonable for 2 processes to execute in reverse order of their creation order (at least on windows and mainstream linux) 多处理不能保证不同线程之间代码执行的顺序,2个进程按其创建顺序的相反顺序执行是完全合理的(至少在windows和主流linux上)

usually when you use parallelization you need worker threads to generate the data then aggregate the data into a thread safe data structure and save that to file, but you are writing to one file here, presumably on to one hard disk, do you have any reason to believe you will get any additional performance by using multiple threads? 通常当您使用并行化时,您需要工作线程生成数据然后将数据聚合到线程安全的数据结构并将其保存到文件,但是您在这里写入一个文件,可能是在一个硬盘上,您有任何理由吗?相信你会通过使用多个线程获得任何额外的性能?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM