I'm trying to read a bunch of HDF5 files ("a bunch" meaning N > 1000 files) using PyTables
and multiprocessing
. Basically, I create a class to read and store my data in RAM; it works perfectly fine in a sequential mode and I'd like to parallelize it to gain some performance.
I tried a dummy approach for now, creating a new method flatten()
to my class to parallelize file reading. The following example is a simplified example of what I'm trying to do. listf
is a list of strings containing the name of the files to read, nx
and ny
are the size of the array I want to read in the file:
import numpy as np
import multiprocessing as mp
import tables
class data:
def __init__(self, listf, nx, ny, nproc=0):
self.listinc = []
for i in range(len(listf)):
self.listinc.append((listf[i], nx, ny))
def __del__(self):
del self.listinc
def get_dsets(self, tuple_inc):
listf, nx, ny = tuple_inc
x = np.zeros((nx, ny))
f = tables.openFile(listf)
x = np.transpose(f.root.x[:ny,:nx])
f.close()
return(x)
def flatten(self):
nproc = mp.cpu_count()*2
def worker(tasks, results):
for i, x in iter(tasks.get, 'STOP'):
print i, x
results.put(i, self.get_dsets(x))
tasks = mp.Queue()
results = mp.Queue()
manager = mp.Manager()
lx = manager.list()
for i, out in enumerate(self.listinc):
tasks.put((i, out))
for i in range(nproc):
mp.Process(target=worker, args=(tasks, results)).start()
for i in range(len(self.listinc)):
j, res = results.get()
lx.append(res)
for i in range(nproc):
tasks.put('STOP')
I tried different things (including, like I did in this simple example, the use of a manager
to retrieve the data) but I always get a TypeError: an integer is required
.
I do not use ctypes array because I don't really require to have shared arrays (I just want to retrieve my data) and after retrieving the data, I want to play with it with NumPy.
Any thought, hint or help would be highly appreciated!
Edit: The complete error I get is the following:
Process Process-341:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/toto/test/rd_para.py", line 81, in worker
results.put(i, self.get_dsets(x))
File "/usr/lib/python2.7/multiprocessing/queues.py", line 101, in put
if not self._sem.acquire(block, timeout):
TypeError: an integer is required
The answer was actually very simple...
In the worker
, since it is a tuple that I retrieve, i can't do:
result.put(i, self.get_dsets(x))
but instead I have to do:
result.put((i, self.get_dsets(x)))
which then works perfectly well.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.