简体   繁体   English

不能腌制<type 'instancemethod'>当使用多处理 Pool.map()</type>

[英]Can't pickle <type 'instancemethod'> when using multiprocessing Pool.map()

I'm trying to use multiprocessing 's Pool.map() function to divide out work simultaneously.我正在尝试使用multiprocessingPool.map() function 同时划分工作。 When I use the following code, it works fine:当我使用以下代码时,它工作正常:

import multiprocessing

def f(x):
    return x*x

def go():
    pool = multiprocessing.Pool(processes=4)        
    print pool.map(f, range(10))


if __name__== '__main__' :
    go()

However, when I use it in a more object-oriented approach, it doesn't work.但是,当我在更面向对象的方法中使用它时,它就不起作用了。 The error message it gives is:它给出的错误信息是:

PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup
__builtin__.instancemethod failed

This occurs when the following is my main program:当以下是我的主程序时会发生这种情况:

import someClass

if __name__== '__main__' :
    sc = someClass.someClass()
    sc.go()

and the following is my someClass class:以下是我的someClass class:

import multiprocessing

class someClass(object):
    def __init__(self):
        pass

    def f(self, x):
        return x*x

    def go(self):
        pool = multiprocessing.Pool(processes=4)       
        print pool.map(self.f, range(10))

Anyone know what the problem could be, or an easy way around it?任何人都知道问题可能是什么,或者解决它的简单方法?

The problem is that multiprocessing must pickle things to sling them among processes, and bound methods are not picklable.问题是多处理必须腌制事物以将它们吊在进程之间,并且绑定的方法是不可腌制的。 The workaround (whether you consider it "easy" or not;-) is to add the infrastructure to your program to allow such methods to be pickled, registering it with the copy_reg standard library method.解决方法(无论您是否认为它“容易”;-)是将基础结构添加到您的程序中以允许对此类方法进行腌制,并将其注册到copy_reg标准库方法。

For example, Steven Bethard's contribution to this thread (towards the end of the thread) shows one perfectly workable approach to allow method pickling/unpickling via copy_reg .例如,Steven Bethard 对该线程的贡献(接近线程的末尾)展示了一种完全可行的方法来允许通过copy_reg进行方法酸洗/取消酸洗。

All of these solutions are ugly because multiprocessing and pickling is broken and limited unless you jump outside the standard library.所有这些解决方案都很丑陋,因为除非您跳出标准库,否则多处理和酸洗会被破坏和限制。

If you use a fork of multiprocessing called pathos.multiprocesssing , you can directly use classes and class methods in multiprocessing's map functions.如果您使用名为pathos.multiprocesssingmultiprocessing分支,则可以直接在多处理的map函数中使用类和类方法。 This is because dill is used instead of pickle or cPickle , and dill can serialize almost anything in python.这是因为dill用于代替picklecPickle ,并且dill几乎可以序列化 python 中的任何内容。

pathos.multiprocessing also provides an asynchronous map function… and it can map functions with multiple arguments (eg map(math.pow, [1,2,3], [4,5,6]) ) pathos.multiprocessing还提供了一个异步映射函数……它可以map具有多个参数的函数(例如map(math.pow, [1,2,3], [4,5,6])

See: What can multiprocessing and dill do together?请参阅: 多处理和莳萝可以一起做什么?

and: http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/和:http: //matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/

>>> import pathos.pools as pp
>>> p = pp.ProcessPool(4)
>>> 
>>> def add(x,y):
...   return x+y
... 
>>> x = [0,1,2,3]
>>> y = [4,5,6,7]
>>> 
>>> p.map(add, x, y)
[4, 6, 8, 10]
>>> 
>>> class Test(object):
...   def plus(self, x, y): 
...     return x+y
... 
>>> t = Test()
>>> 
>>> p.map(Test.plus, [t]*4, x, y)
[4, 6, 8, 10]
>>> 
>>> p.map(t.plus, x, y)
[4, 6, 8, 10]

And just to be explicit, you can do exactly want you wanted to do in the first place, and you can do it from the interpreter, if you wanted to.明确地说,您可以一开始就做您想做的事情,如果您愿意,您可以通过解释器来做。

>>> import pathos.pools as pp
>>> class someClass(object):
...   def __init__(self):
...     pass
...   def f(self, x):
...     return x*x
...   def go(self):
...     pool = pp.ProcessPool(4)
...     print pool.map(self.f, range(10))
... 
>>> sc = someClass()
>>> sc.go()
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
>>> 

Get the code here: https://github.com/uqfoundation/pathos在此处获取代码: https ://github.com/uqfoundation/pathos

You could also define a __call__() method inside your someClass() , which calls someClass.go() and then pass an instance of someClass() to the pool.您还可以在someClass()中定义一个__call__()方法,该方法调用someClass.go()然后将someClass()的实例传递给池。 This object is pickleable and it works fine (for me)...这个对象是可腌制的,它工作正常(对我来说)......

Some limitations though to Steven Bethard's solution : Steven Bethard 的解决方案有一些限制:

When you register your class method as a function, the destructor of your class is surprisingly called every time your method processing is finished.当您将类方法注册为函数时,每次方法处理完成时都会令人惊讶地调用您的类的析构函数。 So if you have 1 instance of your class that calls n times its method, members may disappear between 2 runs and you may get a message malloc: *** error for object 0x...: pointer being freed was not allocated (eg open member file) or pure virtual method called, terminate called without an active exception (which means than the lifetime of a member object I used was shorter than what I thought).因此,如果您的类的 1 个实例调用其方法的 n 次,则成员可能会在 2 次运行之间消失,您可能会收到一条消息malloc: *** error for object 0x...: pointer being freed was not allocated (例如 open成员文件)或pure virtual method called, terminate called without an active exception (这意味着我使用的成员对象的生命周期比我想象的要短)。 I got this when dealing with n greater than the pool size.我在处理大于池大小的 n 时得到了这个。 Here is a short example :这是一个简短的例子:

from multiprocessing import Pool, cpu_count
from multiprocessing.pool import ApplyResult

# --------- see Stenven's solution above -------------
from copy_reg import pickle
from types import MethodType

def _pickle_method(method):
    func_name = method.im_func.__name__
    obj = method.im_self
    cls = method.im_class
    return _unpickle_method, (func_name, obj, cls)

def _unpickle_method(func_name, obj, cls):
    for cls in cls.mro():
        try:
            func = cls.__dict__[func_name]
        except KeyError:
            pass
        else:
            break
    return func.__get__(obj, cls)


class Myclass(object):

    def __init__(self, nobj, workers=cpu_count()):

        print "Constructor ..."
        # multi-processing
        pool = Pool(processes=workers)
        async_results = [ pool.apply_async(self.process_obj, (i,)) for i in range(nobj) ]
        pool.close()
        # waiting for all results
        map(ApplyResult.wait, async_results)
        lst_results=[r.get() for r in async_results]
        print lst_results

    def __del__(self):
        print "... Destructor"

    def process_obj(self, index):
        print "object %d" % index
        return "results"

pickle(MethodType, _pickle_method, _unpickle_method)
Myclass(nobj=8, workers=3)
# problem !!! the destructor is called nobj times (instead of once)

Output:输出:

Constructor ...
object 0
object 1
object 2
... Destructor
object 3
... Destructor
object 4
... Destructor
object 5
... Destructor
object 6
... Destructor
object 7
... Destructor
... Destructor
... Destructor
['results', 'results', 'results', 'results', 'results', 'results', 'results', 'results']
... Destructor

The __call__ method is not so equivalent, because [None,...] are read from the results : __call__方法不是那么等效,因为 [None,...] 是从结果中读取的:

from multiprocessing import Pool, cpu_count
from multiprocessing.pool import ApplyResult

class Myclass(object):

    def __init__(self, nobj, workers=cpu_count()):

        print "Constructor ..."
        # multiprocessing
        pool = Pool(processes=workers)
        async_results = [ pool.apply_async(self, (i,)) for i in range(nobj) ]
        pool.close()
        # waiting for all results
        map(ApplyResult.wait, async_results)
        lst_results=[r.get() for r in async_results]
        print lst_results

    def __call__(self, i):
        self.process_obj(i)

    def __del__(self):
        print "... Destructor"

    def process_obj(self, i):
        print "obj %d" % i
        return "result"

Myclass(nobj=8, workers=3)
# problem !!! the destructor is called nobj times (instead of once), 
# **and** results are empty !

So none of both methods is satisfying...所以这两种方法都不令人满意......

There's another short-cut you can use, although it can be inefficient depending on what's in your class instances.您可以使用另一种快捷方式,尽管它可能效率低下,具体取决于您的类实例中的内容。

As everyone has said the problem is that the multiprocessing code has to pickle the things that it sends to the sub-processes it has started, and the pickler doesn't do instance-methods.正如每个人所说的那样,问题在于multiprocessing代码必须腌制它发送到它已经启动的子进程的东西,并且腌制器不做实例方法。

However, instead of sending the instance-method, you can send the actual class instance, plus the name of the function to call, to an ordinary function that then uses getattr to call the instance-method, thus creating the bound method in the Pool subprocess.但是,您可以不发送实例方法,而是将实际的类实例以及要调用的函数的名称发送到一个普通函数,然后使用getattr调用实例方法,从而在Pool创建绑定方法子进程。 This is similar to defining a __call__ method except that you can call more than one member function.这类似于定义__call__方法,不同之处在于您可以调用多个成员函数。

Stealing @EricH.'s code from his answer and annotating it a bit (I retyped it hence all the name changes and such, for some reason this seemed easier than cut-and-paste :-) ) for illustration of all the magic:从他的答案中窃取@EricH.的代码并对其进行一些注释(我重新输入了它,因此所有名称都发生了变化等等,出于某种原因,这似乎比剪切和粘贴更容易:-))以说明所有的魔力:

import multiprocessing
import os

def call_it(instance, name, args=(), kwargs=None):
    "indirect caller for instance methods and multiprocessing"
    if kwargs is None:
        kwargs = {}
    return getattr(instance, name)(*args, **kwargs)

class Klass(object):
    def __init__(self, nobj, workers=multiprocessing.cpu_count()):
        print "Constructor (in pid=%d)..." % os.getpid()
        self.count = 1
        pool = multiprocessing.Pool(processes = workers)
        async_results = [pool.apply_async(call_it,
            args = (self, 'process_obj', (i,))) for i in range(nobj)]
        pool.close()
        map(multiprocessing.pool.ApplyResult.wait, async_results)
        lst_results = [r.get() for r in async_results]
        print lst_results

    def __del__(self):
        self.count -= 1
        print "... Destructor (in pid=%d) count=%d" % (os.getpid(), self.count)

    def process_obj(self, index):
        print "object %d" % index
        return "results"

Klass(nobj=8, workers=3)

The output shows that, indeed, the constructor is called once (in the original pid) and the destructor is called 9 times (once for each copy made = 2 or 3 times per pool-worker-process as needed, plus once in the original process).输出显示,确实,构造函数被调用一次(在原始 pid 中),而析构函数被调用 9 次(每个复制一次 = 2 或 3 次每个 pool-worker-process 需要,加上一次在原始过程)。 This is often OK, as in this case, since the default pickler makes a copy of the entire instance and (semi-) secretly re-populates it—in this case, doing:这通常是可以的,就像在这种情况下一样,因为默认的pickler会复制整个实例并(半)秘密地重新填充它——在这种情况下,这样做:

obj = object.__new__(Klass)
obj.__dict__.update({'count':1})

—that's why even though the destructor is called eight times in the three worker processes, it counts down from 1 to 0 each time—but of course you can still get into trouble this way. ——这就是为什么即使在三个工作进程中调用了八次析构函数,它每次都从 1 倒数到 0 ——当然,这样你仍然会遇到麻烦。 If necessary, you can provide your own __setstate__ :如有必要,您可以提供自己的__setstate__

    def __setstate__(self, adict):
        self.count = adict['count']

in this case for instance.例如在这种情况下。

You could also define a __call__() method inside your someClass() , which calls someClass.go() and then pass an instance of someClass() to the pool.您还可以在someClass()中定义一个__call__()方法,该方法调用someClass.go()然后将someClass()的实例传递给池。 This object is pickleable and it works fine (for me)...这个对象是可腌制的,它工作正常(对我来说)......

class someClass(object):
   def __init__(self):
       pass
   def f(self, x):
       return x*x

   def go(self):
      p = Pool(4)
      sc = p.map(self, range(4))
      print sc

   def __call__(self, x):   
     return self.f(x)

sc = someClass()
sc.go()

The solution from parisjohn above works fine with me.上面parisjohn的解决方案对我来说很好。 Plus the code looks clean and easy to understand.此外,代码看起来干净且易于理解。 In my case there are a few functions to call using Pool, so I modified parisjohn's code a bit below.在我的例子中,有几个函数可以使用 Pool 调用,所以我在下面修改了 parisjohn 的代码。 I made __call__ to be able to call several functions, and the function names are passed in the argument dict from go() :我使__call__能够调用多个函数,并且函数名称在go()的参数 dict 中传递:

from multiprocessing import Pool
class someClass(object):
    def __init__(self):
        pass
    
    def f(self, x):
        return x*x
    
    def g(self, x):
        return x*x+1    

    def go(self):
        p = Pool(4)
        sc = p.map(self, [{"func": "f", "v": 1}, {"func": "g", "v": 2}])
        print sc

    def __call__(self, x):
        if x["func"]=="f":
            return self.f(x["v"])
        if x["func"]=="g":
            return self.g(x["v"])        

sc = someClass()
sc.go()

In this simple case, where someClass.f is not inheriting any data from the class and not attaching anything to the class, a possible solution would be to separate out f , so it can be pickled:在这个简单的情况下, someClass.f没有从类继承任何数据并且没有将任何东西附加到类,一个可能的解决方案是分离出f ,所以它可以被腌制:

import multiprocessing


def f(x):
    return x*x


class someClass(object):
    def __init__(self):
        pass

    def go(self):
        pool = multiprocessing.Pool(processes=4)       
        print pool.map(f, range(10))

A potentially trivial solution to this is to switch to using multiprocessing.dummy .一个潜在的简单解决方案是切换到使用multiprocessing.dummy This is a thread based implementation of the multiprocessing interface that doesn't seem to have this problem in Python 2.7.这是多处理接口的基于线程的实现,在 Python 2.7 中似乎没有这个问题。 I don't have a lot of experience here, but this quick import change allowed me to call apply_async on a class method.我在这里没有很多经验,但是这个快速的导入更改允许我在类方法上调用 apply_async。

A few good resources on multiprocessing.dummy :关于multiprocessing.dummy的一些好资源:

https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.dummy https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.dummy

http://chriskiehl.com/article/parallelism-in-one-line/ http://chriskiehl.com/article/parallelism-in-one-line/

Why not to use separate func?为什么不使用单独的函数?

def func(*args, **kwargs):
    return inst.method(args, kwargs)

print pool.map(func, arr)

I ran into this same issue but found out that there is a JSON encoder that can be used to move these objects between processes.我遇到了同样的问题,但发现有一个 JSON 编码器可用于在进程之间移动这些对象。

from pyVmomi.VmomiSupport import VmomiJSONEncoder

Use this to create your list:使用它来创建您的列表:

jsonSerialized = json.dumps(pfVmomiObj, cls=VmomiJSONEncoder)

Then in the mapped function, use this to recover the object:然后在映射函数中,使用它来恢复对象:

pfVmomiObj = json.loads(jsonSerialized)

Update: as of the day of this writing, namedTuples are pickable (starting with python 2.7)更新:截至撰写本文时,namedTuples 是可选的(从 python 2.7 开始)

The issue here is the child processes aren't able to import the class of the object -in this case, the class P-, in the case of a multi-model project the Class P should be importable anywhere the child process get used这里的问题是子进程无法导入对象的类 - 在这种情况下是类 P-,在多模型项目的情况下,类 P 应该可以在任何使用子进程的地方导入

a quick workaround is to make it importable by affecting it to globals()一个快速的解决方法是通过将其影响到 globals() 使其可导入

globals()["P"] = P

pathos.multiprocessing worked for me. pathos.multiprocessing为我工作。

It has a pool method and serializes everything unlike multiprocessing它有一个pool方法并序列化所有与multiprocessing不同的东西

import pathos.multiprocessing as mp
pool = mp.Pool(processes=2) 

There is even no need of installing full pathos package.甚至不需要安装完整的 pathos package。

Actually the only package needed is dill ( pip install dill ), and then override multiprocessing Pickler with the dill one:实际上,唯一需要的 package 是 dill ( pip install dill ),然后用 dill 覆盖多处理 Pickler :

dill.Pickler.dumps, dill.Pickler.loads = dill.dumps, dill.loads
multiprocessing.reduction.ForkingPickler = dill.Pickler
multiprocessing.reduction.dump = dill.dump

This answer was borrowed from https://stackoverflow.com/a/69253561/10686785这个答案是从https://stackoverflow.com/a/69253561/10686785借来的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM