简体   繁体   中英

Is it safe to pass a function or a class instance to the “target” argument of multiprocessing.Process without the risk of dense copying?

I wonder is it possible to pass a function of another class, which has lots of imports and is pretty dense, to an instance of multiprocessing.Process as an argument? Note that I am going to run this code on a Unix-based machine and therefore Process will fork rather than spawn . Here's an example:

#class1.py
from class3 import Class3
class Class1(object):
    def __init__(self):
        self.class3Instance = Class3()

    def func1(self):
        self.class3Instance.func3()

#class3.py
import numpy as np
import pandas
import cv2 # OpenCV library
# there are many other things that I am importing here

class Class3(object):
    def __init__(self):
        pass

    def func3(self):
        np.random.seed(1)
        print ('func3 changed the random seed')

#class2.py
import numpy as np
class Class2(object):
    def __init__(self):
        pass

    def func2(self, funcInput):
        funcInput()

#main.py
from class1 import Class1
from class2 import Class2

class1Instance = Class1()
class2Instance = Class2()
from multiprocessing import Process

class2Process = Process(target=class2Instance.func2, kwargs={'funcInput': class1Instance.func1})
class2Process.start()
class2Process.join()

This example seems to work fine for such a small scale but I'm afraid multiprocessing.Process will not be able to fork things properly in this case and instead try to make a dense copy of the classes in the hierarchy. I do not want that to be the case. Is that a valid argument?

multiprocessing.Process , used in fork mode, will not need to pickle the bound method (which would require pickling the instance), so there will be minimal up front cost paid. There is no documented guarantee of this AFAICT, but CPython's implementation using fork doesn't do so, and there is no reason for them to do so, so I can't see them taking away that feature when there is nothing to be gained by it.

That said, the nature of CPython's reference counted design (with a cyclic garbage collector to handle the failures of reference counting) is such that the object header for all Python objects will be touched intermittently, which will cause any page containing a small object to get copied, so while the CPU work involved in actually performing a serialize/deserialize cycle won't occur, a long running Process will usually end up sharing few pages with the parent process.

Also note that multiprocessing.Process , in fork mode, is the only time you benefit from this. forkserver and spawn launch methods don't get copy-on-write copies of parental pages, and therefore can't benefit, and multiprocessing.Pool methods like apply / apply_async and the various map -like functions always pickle both the function to be called and its arguments (because the worker processes don't know what tasks they'll be asked to run when they're forked, and the objects could have changed post-fork, so it always repickles them every time).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM