繁体   English   中英

使用python并行读取文件中的数据并将其附加到列表中

[英]Reading and appending data from files to a list in parallel using python

我正在尝试读取几个文件,并将其中的某些元素附加到列表中。 读取文件似乎很慢,所以我认为multiprocessing可能会帮助我。 我生成了以下代码来执行我想要的操作,基本上并行打开编号的file_%i ,并提取相关数据read_append并将其附加到进程之间共享的global数组res = manager.list() 下面给出的示例代码。 但是,这不起作用。 尝试打印a.shape会给出示例代码下方包含的错误消息。 我不太确定如何修复这个错误的代码,并且对multiprocessing很陌生。 我怀疑,我使用 SO 答案和用于多处理的手册页放在一起的这个 hacky 脚本远非理想。

import multiprocessing as mp
import numpy as np
from timeit import default_timer as timer
start = timer()
def read_append(input_list):
    val, res_arr = input_list
    data_file = np.load('file_%i.npz' %val, mmap_mode = 'r', allow_pickle=True)['data']
    for i in range(len(data_file)):
        res_arr.append(data_file[i][1])
    return None


if __name__ == '__main__':
    N= mp.cpu_count()
    print(N)
    with mp.Manager() as manager:
        res = manager.list()
        input_list = [(val, res) for val in range(2)]
        with mp.Pool(processes = N) as p:
            results = p.map(read_append,input_list)
end = timer()
print(end-start)
a = list(res)
print(a.shape)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/anaconda3/lib/python3.7/multiprocessing/managers.py in _callmethod(self, methodname, args, kwds)
    810         try:
--> 811             conn = self._tls.connection
    812         except AttributeError:

AttributeError: 'ForkAwareLocal' object has no attribute 'connection'

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-13-35028af51086> in <module>
     21 end = timer()
     22 print(end-start)
---> 23 a = list(res)
     24 print(a.shape)

<string> in __len__(self, *args, **kwds)

~/anaconda3/lib/python3.7/multiprocessing/managers.py in _callmethod(self, methodname, args, kwds)
    813             util.debug('thread %r does not own a connection',
    814                        threading.current_thread().name)
--> 815             self._connect()
    816             conn = self._tls.connection
    817 

~/anaconda3/lib/python3.7/multiprocessing/managers.py in _connect(self)
    800         if threading.current_thread().name != 'MainThread':
    801             name += '|' + threading.current_thread().name
--> 802         conn = self._Client(self._token.address, authkey=self._authkey)
    803         dispatch(conn, None, 'accept_connection', (name,))
    804         self._tls.connection = conn

~/anaconda3/lib/python3.7/multiprocessing/connection.py in Client(address, family, authkey)
    490         c = PipeClient(address)
    491     else:
--> 492         c = SocketClient(address)
    493 
    494     if authkey is not None and not isinstance(authkey, bytes):

~/anaconda3/lib/python3.7/multiprocessing/connection.py in SocketClient(address)
    617     with socket.socket( getattr(socket, family) ) as s:
    618         s.setblocking(True)
--> 619         s.connect(address)
    620         return Connection(s.detach())
    621 

FileNotFoundError: [Errno 2] No such file or directory
  1. 我不认为res是一个global变量,你为什么这么认为?
  2. 列表没有属性shape ,numpy 数组有。
  3. 您在关闭列表所在的管理器进程后尝试访问托管列表res 因此,您需要在with mp.Manager() as manager块中移动使用res的代码:
  4. 除了主进程之外,您的计时器实际上并没有测量任何有用的东西。 在子进程中,它实际上是在测量导入库和定义函数所需的时间。 您应该考虑将其转移到 main.js 中。 如果您想计算每个函数花费的时间,请在函数内部启动计时器并返回end-start

示例固定代码:

import multiprocessing as mp
import numpy as np
from timeit import default_timer as timer


def read_append(input_list):
    val, res_arr = input_list
    data_file = np.load('file_%i.npz' %val, mmap_mode = 'r', allow_pickle=True)['data']
    for i in range(len(data_file)):
        res_arr.append(data_file[i][1])
    return None


if __name__ == '__main__':
    start = timer()
    N= mp.cpu_count()
    print(N)
    with mp.Manager() as manager:
        res = manager.list()
        input_list = [(val, res) for val in range(2)]
        with mp.Pool(processes = N) as p:
            results = p.map(read_append,input_list)
        a = np.array(res)
        print(a.shape)
    end = timer()
    print(end - start)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM