简体   繁体   English

使用多重处理在Python中读取多个HDF5文件

[英]Read multiple HDF5 files in Python using multiprocessing

I'm trying to read a bunch of HDF5 files ("a bunch" meaning N > 1000 files) using PyTables and multiprocessing . 我正在尝试使用PyTablesmultiprocessing读取一堆HDF5文件(“一堆”表示N> 1000个文件)。 Basically, I create a class to read and store my data in RAM; 基本上,我创建了一个类来读取我的数据并将其存储在RAM中。 it works perfectly fine in a sequential mode and I'd like to parallelize it to gain some performance. 它在顺序模式下工作得很好,我想对其进行并行化以获得一些性能。

I tried a dummy approach for now, creating a new method flatten() to my class to parallelize file reading. 我现在尝试了一种虚拟方法,为类创建了一个新的方法flatten()来并行化文件读取。 The following example is a simplified example of what I'm trying to do. 下面的示例是我要执行的操作的简化示例。 listf is a list of strings containing the name of the files to read, nx and ny are the size of the array I want to read in the file: listf是包含要读取的文件名的字符串列表, nxny是我要在文件中读取的数组的大小:

import numpy as np
import multiprocessing as mp
import tables

class data:
    def __init__(self, listf, nx, ny, nproc=0):
        self.listinc = []
        for i in range(len(listf)):
             self.listinc.append((listf[i], nx, ny))

    def __del__(self):
        del self.listinc

    def get_dsets(self, tuple_inc):
        listf, nx, ny = tuple_inc
        x = np.zeros((nx, ny))
        f = tables.openFile(listf)
        x = np.transpose(f.root.x[:ny,:nx])
        f.close()
        return(x)

    def flatten(self):
        nproc = mp.cpu_count()*2

        def worker(tasks, results):
            for i, x in iter(tasks.get, 'STOP'):
                print i, x
                results.put(i, self.get_dsets(x))

        tasks   = mp.Queue()
        results = mp.Queue()
        manager = mp.Manager()
        lx      = manager.list()

        for i, out in enumerate(self.listinc):
            tasks.put((i, out))

        for i in range(nproc):
            mp.Process(target=worker, args=(tasks, results)).start()

        for i in range(len(self.listinc)):
            j, res = results.get()
            lx.append(res)

        for i in range(nproc):
            tasks.put('STOP')

I tried different things (including, like I did in this simple example, the use of a manager to retrieve the data) but I always get a TypeError: an integer is required . 我尝试了不同的操作(包括像在此简单示例中所做的那样,包括使用manager来检索数据),但是我总是遇到TypeError: an integer is required

I do not use ctypes array because I don't really require to have shared arrays (I just want to retrieve my data) and after retrieving the data, I want to play with it with NumPy. 我不使用ctypes数组,因为我真的不需要共享数组(我只想检索我的数据),并且在检索数据后,我想与NumPy一起玩。

Any thought, hint or help would be highly appreciated! 任何想法,提示或帮助将不胜感激!

Edit: The complete error I get is the following: 编辑:我得到的完整错误如下:

Process Process-341:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/toto/test/rd_para.py", line 81, in worker
    results.put(i, self.get_dsets(x))
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 101, in put
    if not self._sem.acquire(block, timeout):
TypeError: an integer is required

The answer was actually very simple... 答案实际上非常简单...

In the worker , since it is a tuple that I retrieve, i can't do: worker ,因为它是我检索到的元组,所以我不能这样做:

result.put(i, self.get_dsets(x))

but instead I have to do: 但是我必须做:

result.put((i, self.get_dsets(x)))

which then works perfectly well. 然后效果很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM