简体   繁体   English

与pathos.multiprocessing并行安全地写入文件

[英]Safely write to file in parallel with pathos.multiprocessing

pathos.multiprocessing is known to have advantage over multiprocessing library in Python in the sense that the former uses dill instead of pickle and can serialize wider range of functions and other things. 众所周知, pathos.multiprocessing比Python中的multiprocessing库具有优势,因为前者使用dill而不是pickle并且可以序列化更广泛的函数和其他内容。

But when it comes to writing pool.map() results to file line-wise using pathos , there comes up some trouble. 但是,当涉及到写作pool.map()结果文件行明智使用pathos ,总会有一些麻烦。 If all processes in ProcessPool write results line-wise into a single file, they would interfere to each other writing some lines simultaneously and spoiling the job. 如果ProcessPool所有进程将结果逐行写入单个文件中,则它们将相互干扰,同时写入一些行并破坏工作。 In using ordinary multiprocessing package, I was able to make processes write to their own separate files, named with the current process id, like this: 通过使用普通的multiprocessing程序包,我能够使进程写入其自己的单独文件,并以当前进程ID命名,如下所示:

example_data = range(100)
def process_point(point):
    output = "output-%d.gz" % mpp.current_process().pid
    with gzip.open(output, "a+") as fout:
        fout.write('%d\n' % point**2)

Then, this code works well: 然后,此代码运行良好:

import multiprocessing as mpp
pool = mpp.Pool(8)
pool.map(process_point, example_data)

But this code doesn't: 但是这段代码没有:

from pathos import multiprocessing as mpp
pool = mpp.Pool(8)
pool.map(process_point, example_data)

and throws AttributeError : 并抛出AttributeError

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-a6fb174ec9a5> in <module>()
----> 1 pool.map(process_point, example_data)

/usr/local/lib/python2.7/dist-packages/processing-0.52_pathos-py2.7-linux-x86_64.egg/processing/pool.pyc in map(self, func, iterable, chunksize)
    128         '''
    129         assert self._state == RUN
--> 130         return self.mapAsync(func, iterable, chunksize).get()
    131
    132     def imap(self, func, iterable, chunksize=1):

/usr/local/lib/python2.7/dist-packages/processing-0.52_pathos-py2.7-linux-x86_64.egg/processing/pool.pyc in get(self, timeout)
    371             return self._value
    372         else:
--> 373             raise self._value
    374
    375     def _set(self, i, obj):

AttributeError: 'module' object has no attribute 'current_process'

There is no current_process() in pathos , and I cannot find anything similar to it. pathos没有current_process() ,我找不到任何类似的东西。 Any ideas? 有任何想法吗?

This simple trick seems to do the job: 这个简单的技巧似乎可以完成任务:

import multiprocessing as mp
from pathos import multiprocessing as pathos_mp
import gzip

example_data = range(100)
def process_point(point):
    output = "output-%d.gz" % mp.current_process().pid
    with gzip.open(output, "a+") as fout:
        fout.write('%d\n' % point**2)

pool = pathos_mp.Pool(8)
pool.map(process_point, example_data)

To put differently, one can use pathos for parallel computation, and ordinary multiprocessing package for getting id of current process, and this will work correctly! pathos ,可以使用pathos进行并行计算,并使用普通的multiprocessing程序包获取当前进程的id,这将正常工作!

I'm the pathos author. 我是pathos作家。 While your answer works for this case, it's probably better to use the fork of multiprocessing within pathos , found at the rather obtuse location: pathos.helpers.mp . 虽然你的答案适用于这种情况下,它可能会更好用的叉子multiprocessingpathos ,在这个非常无厘头的发现地点: pathos.helpers.mp

This gives you a one-to-one mapping with multiprocessing , but with better serialization. 这为您提供了具有multiprocessing的一对一映射,但具有更好的序列化。 Thus, you'd use pathos.helpers.mp.current_process . 因此,您将使用pathos.helpers.mp.current_process

Sorry, it's both undocumented and not obvious… I should improve at least one of those two issues. 抱歉,这既没有记录,也没有明显……我应该至少改善这两个问题之一。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM