简体   繁体   中英

How to write in HDF5 file using multiprocessing?

So in my code I have something like this:

import tables
import bson

def proc():
    data = bson.decode_file_iter(open('file.bson'), 'rb'))
    atom = tables.Float64Atom()
    f = tables.open_file('t.hdf5', mode='w')
    array_c = f.create_earray(f.root, 'data', atom, (0, m))

    for c,d in enumerate(data):
        for e,p in enumerate(d['id']):
            x = some_array1bym()
            array_c.append(x)

     f.close()

This works fine, but I want to right this with multiprocessing, since I am new to this, I dont know how to do this exactly, I found something like this:

def proc():
    NCORE = 6               
    data = bson.decode_file_iter(open('file.bson'), 'rb'))
    atom = tables.Float64Atom()
    f = tables.open_file('t.hdf5', mode='w')
    array_c = f.create_earray(f.root, 'data', atom, (0, m))

    def process(q, iolock):
        while True:
           d = q.get()
           if d is None:
               break
           for e, p in enumerate(d['id']):
               x = some_array1bym()
               array_c.append(x)

     q = mp.Queue(maxsize=NCORE)
     iolock = mp.Lock()
     pool = mp.Pool(NCORE, initializer=process, initarg=(q,iolock))

     for c,d in enumerate(data):
        q.put(d)

     for _ in range(NCORE):
         q.put(None)
     pool.close()
     pool.join()

     f.close()

This however gives me an empty file.

Can anybody help?

Thanks!

You've misunderstood the use of multiprocessing.Pool slightly. When you initialize a Pool it starts up N worker processes. The initializer argument is just a function that is run once on each worker process when it starts up. It's not a task that the processes will be performing later. You then use methods like Pool.map or Pool.apply (or their async complements) to actually submit jobs to the pool for processing.

I think the problem may have to do with the array_c variable. After the Pool forks, each worker will get a copy of this variable. This will results in either a) every one of these copies of array_c will try to write to the hdf5 file, giving undefined results or b) only the copy in the main process, which is empty, will write to the file when f.close() is called. I'm not sure which as I am not familiar with the internals of pytables .

In contrast to array_c , q and iolock are shared between all workers and the main thread. q is an mp.Queue instance and iolock is an mp.Lock instance, and these classes are specially designed to be used by multiple processes at once. I don't think the same is true of pytables classes.

You should be using the mp.Lock instance to ensure that only one process writes to the file at once. I think what you will have to do is modify your process function to something like the following:

def process(q, iolock):
    while True:
        d = q.get()
        if d is None:
            break
        for e, p in enumerate(d['id']):
            x = some_array1bym()
            # acquire lock to ensure only one process writes data at once
            iolock.acquire()
            # get new handle to hdf5 file in append mode
            f = tables.open_file('t.hdf5', mode='a')
            # get new handle to data set and append new data
            array_c = tables.EArray(f.root, 'data')
            array_c.append(x)
            # close file and release lock to allow another process to write
            f.close()
            iolock.release()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM