[英]multiprocessing for loop with return function that takes more than one arguments
I know this question was asked multiple times but I could not find a case similar to mine. 我知道这个问题曾被问过多次,但我找不到与我类似的情况。
I have this function: 我有这个功能:
def load_data(list_of_files, INP_DIR, return_featues=False):
data = []
# ------- I want to multithread this block------#
for file_name in tqdm(list_of_files):
subject , features = load_subject(INP_DIR,file_name)
data.append(subject.reset_index())
# -------------#
data = pd.concat(data, axis=0, ignore_index=True)
target = data['label']
if return_featues:
return data,target, features
else:
return data,target
The above function use load_subject
and for your references, it's defined as follow: 上面的函数使用load_subject
,供您参考,其定义如下:
def load_subject(INP_DIR,file_name):
subject= pd.read_csv(INP_DIR+ file_name, sep='|')
< do some processing ...>
return subject, features
I have 64 cores on CPU but I am not able to use all of them. 我在CPU上有64个内核,但无法使用所有内核。
I tried this with multiprocessing
我尝试了multiprocessing
train_files= ['p011431.psv', 'p008160.psv', 'p007253.psv', 'p018373.psv', 'p017040.psv',]
from multiprocessing import Pool
if __name__ == '__main__':
with Pool(processes=64) as pool:
pool.map(load_data, train_files)
as you see, train_files is a list of name of files. 如您所见,train_files是文件名称的列表。
When I run the above lines, I get this error: 当我运行上述行时,出现此错误:
---------------------------------------------------------------------------
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
TypeError: load_subject() missing 1 required positional argument: 'file_name'
"""
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
<ipython-input-24-96a3ce89ebb8> in <module>()
2 if __name__ == '__main__':
3 with Pool(processes=2) as pool:
----> 4 pool.map(load_subject, train_files) # process data_inputs iterable with pool
/anaconda3/lib/python3.6/multiprocessing/pool.py in map(self, func, iterable, chunksize)
264 in a list that is returned.
265 '''
--> 266 return self._map_async(func, iterable, mapstar, chunksize).get()
267
268 def starmap(self, func, iterable, chunksize=None):
/anaconda3/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
642 return self._value
643 else:
--> 644 raise self._value
645
646 def _set(self, i, obj):
TypeError: load_subject() missing 1 required positional argument: 'file_name'
After the answer of Tom, I could find another way to pass only one argument. 在汤姆回答之后,我可以找到另一种只传递一个论点的方法。
Here are the functions and you will see the error I am getting: 这些是函数,您将看到我得到的错误:
def load_data(list_of_files):
data = []
# ------- I want to multithread this block------#
for file_name in tqdm(list_of_files):
subject , features = load_subject(INP_DIR,file_name)
data.append(subject.reset_index())
# -------------#
data = pd.concat(data, axis=0, ignore_index=True)
target = data['label']
return data,target
def load_subject(file_name):
subject= pd.read_csv(file_name, sep='|')
< do some processing ...>
return subject, features
train_files= ['p011431.psv', 'p008160.psv', 'p007253.psv', 'p018373.psv']
from multiprocessing import Pool
if __name__ == '__main__':
with Pool(processes=64) as pool:
pool.map(load_data, train_files)
When I run the above lines, I get a new error: 当我运行上述行时,出现一个新错误:
---------------------------------------------------------------------------
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "<ipython-input-21-494105028a08>", line 407, in load_data
subject , features = load_subject(file_name)
File "<ipython-input-21-494105028a08>", line 170, in load_subject
subject= pd.read_csv(file_name, sep='|')
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
self._make_engine(self.engine)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 539, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 737, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
"""
The above exception was the direct cause of the following exception:
ParserError Traceback (most recent call last)
<ipython-input-22-d6dcc5840b63> in <module>()
4
5 with Pool(processes=3) as pool:
----> 6 pool.map(load_data, files)
/anaconda3/lib/python3.6/multiprocessing/pool.py in map(self, func, iterable, chunksize)
264 in a list that is returned.
265 '''
--> 266 return self._map_async(func, iterable, mapstar, chunksize).get()
267
268 def starmap(self, func, iterable, chunksize=None):
/anaconda3/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
642 return self._value
643 else:
--> 644 raise self._value
645
646 def _set(self, i, obj):
ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
multiprocessing's Pool.map()
function only can pass one argument at a time. 多重处理的Pool.map()
函数一次只能传递一个参数。 I believe there's a "proper" workaround for this in Python 3, but I used the following hack in Python 2 all the time and see no reason why it wouldn't still work. 我相信Python 3中对此有一个“适当的”解决方法,但是我一直在Python 2中使用以下技巧,并且没有理由认为它仍然无法正常工作。
Define a wrapper for load_subject
which only takes one argument, define a special object to use for that argument. 为load_subject
定义一个仅包含一个参数的包装器,并定义一个用于该参数的特殊对象。
def wrapped_load_subject(param):
return load_subject(param.inp_dir, param.file_name)
class LoadSubjectParam:
def __init__(inp_dir, file_name):
self.inp_dir = inp_dir
self.file_name = file_name
train_files = [] # Make this a list of LoadSubjectParam objects
with Pool(processes=64) as pool:
pool.map(wrapped_load_subject, train_files)
Your load_data
accept list_of_files
, then you can not pass list_of_files
to pool.map
. 您的load_data
接受list_of_files
,那么您不能将list_of_files
传递给pool.map
。 It should be list of list_of_files
. 它应该是list of list_of_files
。
Get result like this: 得到这样的结果:
with Pool(processes=64) as pool:
res = pool.map(load_data, train_files)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.