So in my code I have something like this:
import tables
import bson
def proc():
data = bson.decode_file_iter(open('file.bson'), 'rb'))
atom = tables.Float64Atom()
f = tables.open_file('t.hdf5', mode='w')
array_c = f.create_earray(f.root, 'data', atom, (0, m))
for c,d in enumerate(data):
for e,p in enumerate(d['id']):
x = some_array1bym()
array_c.append(x)
f.close()
This works fine, but I want to right this with multiprocessing, since I am new to this, I dont know how to do this exactly, I found something like this:
def proc():
NCORE = 6
data = bson.decode_file_iter(open('file.bson'), 'rb'))
atom = tables.Float64Atom()
f = tables.open_file('t.hdf5', mode='w')
array_c = f.create_earray(f.root, 'data', atom, (0, m))
def process(q, iolock):
while True:
d = q.get()
if d is None:
break
for e, p in enumerate(d['id']):
x = some_array1bym()
array_c.append(x)
q = mp.Queue(maxsize=NCORE)
iolock = mp.Lock()
pool = mp.Pool(NCORE, initializer=process, initarg=(q,iolock))
for c,d in enumerate(data):
q.put(d)
for _ in range(NCORE):
q.put(None)
pool.close()
pool.join()
f.close()
This however gives me an empty file.
Can anybody help?
Thanks!
You've misunderstood the use of multiprocessing.Pool
slightly. When you initialize a Pool
it starts up N
worker processes. The initializer
argument is just a function that is run once on each worker process when it starts up. It's not a task that the processes will be performing later. You then use methods like Pool.map
or Pool.apply
(or their async complements) to actually submit jobs to the pool for processing.
I think the problem may have to do with the array_c
variable. After the Pool forks, each worker will get a copy of this variable. This will results in either a) every one of these copies of array_c
will try to write to the hdf5 file, giving undefined results or b) only the copy in the main process, which is empty, will write to the file when f.close()
is called. I'm not sure which as I am not familiar with the internals of pytables
.
In contrast to array_c
, q
and iolock
are shared between all workers and the main thread. q
is an mp.Queue
instance and iolock
is an mp.Lock
instance, and these classes are specially designed to be used by multiple processes at once. I don't think the same is true of pytables
classes.
You should be using the mp.Lock
instance to ensure that only one process writes to the file at once. I think what you will have to do is modify your process
function to something like the following:
def process(q, iolock):
while True:
d = q.get()
if d is None:
break
for e, p in enumerate(d['id']):
x = some_array1bym()
# acquire lock to ensure only one process writes data at once
iolock.acquire()
# get new handle to hdf5 file in append mode
f = tables.open_file('t.hdf5', mode='a')
# get new handle to data set and append new data
array_c = tables.EArray(f.root, 'data')
array_c.append(x)
# close file and release lock to allow another process to write
f.close()
iolock.release()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.