简体   繁体   English

在 Python 中使用本地函数进行多处理的解决方法?

[英]Workaround for multiprocessing with local functions in Python?

Multiprocessing with locally defined functions?使用本地定义的函数进行多处理?

I am porting over a library for a client who is very picky about external dependencies.我正在为一个对外部依赖项非常挑剔的客户移植一个库。

The majority of the multiprocessing in this library is supported by the pathos ProcessPool module. pathos ProcessPool 模块支持这个库中的大多数多处理。 The main reason being that it can very easily deal with locally defined functions.主要原因是它可以很容易地处理本地定义的函数。

I'm trying to get some of this functionality back without forcing this dependence (or having to rewrite large chunks of the library).我试图在不强迫这种依赖(或不得不重写库的大块)的情况下恢复其中的一些功能。 I understand that the following code works because the function is defined at the top level:我了解以下代码有效,因为 function 是在顶层定义的:

import multiprocessing as mp


def f(x):
    return x * x


def main():
    with mp.Pool(5) as p:
        print(p.map(f, [i for i in range(10)]))


if __name__ == "__main__":
    main()

The following code (which is what I need to get working) fails as the function is only defined in the local scope:以下代码(这是我需要开始工作的)失败,因为 function 仅在本地 scope 中定义:

import multiprocessing as mp


def main():
    def f(x):
        return x * x

    with mp.Pool(5) as p:
        print(p.map(f, [i for i in range(10)]))


if __name__ == "__main__":
    main()

Anyone know of a good workaround for this specific use case which doesn't require external dependancies?任何人都知道这个不需要外部依赖的特定用例的好解决方法吗? Thanks for reading.谢谢阅读。

Updates:更新:

  • There is a work around that uses fork but this is unsafe for Mac and Windows (thanks @Monica and @user2357112).有一个使用 fork 的解决方法,但这对于 Mac 和 Windows 是不安全的(感谢@Monica 和@user2357112)。
  • @Blop provided an excellent suggestion that will work for many. @Blop 提供了一个很好的建议,适用于许多人。 In my case (not the toy example above) the objects in my generator are unmarshallable.在我的情况下(不是上面的玩具示例),我的生成器中的对象是不可编组的。
  • @amsh provided a workaround which seems to work for any function + generator. @amsh 提供了一种解决方法,似乎适用于任何 function + 生成器。 While a great option, the downside is it that it requires the function be defined at the global scope.虽然是一个不错的选择,但缺点是它需要在全局 scope 中定义 function。

the main problem is the closure variables.主要问题是闭包变量。

if you don't have those it can be done like this:如果你没有这些,可以这样做:

import marshal
import multiprocessing
import types
from functools import partial


def main():
    def internal_func(c):
        return c*c

    with multiprocessing.Pool(5) as pool:
        print(internal_func_map(pool, internal_func, [i for i in range(10)]))


def internal_func_map(pool, f, gen):
    marshaled = marshal.dumps(f.__code__)
    return pool.map(partial(run_func, marshaled=marshaled), gen)


def run_func(*args, **kwargs):
    marshaled = kwargs.pop("marshaled")
    func = marshal.loads(marshaled)

    restored_f = types.FunctionType(func, globals())
    return restored_f(*args, **kwargs)


if __name__ == "__main__":
    main()

the idea is that the function code has everything you need in order to run it in a new process.这个想法是 function 代码拥有在新进程中运行它所需的一切。 notice that no external dependencies are needed, just regular python libraries.请注意,不需要外部依赖项,只需要常规的 python 库。

If closures are indeed needed, then the most difficult part about this solution is actually creating them.如果确实需要闭包,那么这个解决方案最困难的部分实际上是创建它们。 (in closure there is something called a "cell" which is not very easy to create by code...) (最后有一种叫做“单元格”的东西,它不是很容易通过代码创建......)

Here is the somewhat elaborate working code:这是一些复杂的工作代码:

import marshal
import multiprocessing
import pickle
import types
from functools import partial


class A:
    def __init__(self, a):
        self.a = a


def main():
    x = A(1)

    def internal_func(c):
        return x.a + c

    with multiprocessing.Pool(5) as pool:
        print(internal_func_map(pool, internal_func, [i for i in range(10)]))


def internal_func_map(pool, f, gen):
    closure = f.__closure__
    marshaled_func = marshal.dumps(f.__code__)
    pickled_closure = pickle.dumps(tuple(x.cell_contents for x in closure))
    return pool.map(partial(run_func, marshaled_func=marshaled_func, pickled_closure=pickled_closure), gen)


def run_func(*args, **kwargs):
    marshaled_func = kwargs.pop("marshaled_func")
    func = marshal.loads(marshaled_func)
    pickled_closure = kwargs.pop("pickled_closure")
    closure = pickle.loads(pickled_closure)

    restored_f = types.FunctionType(func, globals(), closure=create_closure(func, closure))
    return restored_f(*args, **kwargs)


def create_closure(func, original_closure):
    indent = " " * 4
    closure_vars_def = f"\n{indent}".join(f"{name}=None" for name in func.co_freevars)
    closure_vars_ref = ",".join(func.co_freevars)
    dynamic_closure = "create_dynamic_closure"
    s = (f"""
def {dynamic_closure}():
    {closure_vars_def}
    def internal():
        {closure_vars_ref}
    return internal.__closure__
""")
    exec(s)
    created_closure = locals()[dynamic_closure]()
    for closure_var, value in zip(created_closure, original_closure):
        closure_var.cell_contents = value
    return created_closure


if __name__ == "__main__":
    main()

Hope that helps or at least gives you some ideas on how to tackle this problem!希望对您有所帮助或至少为您提供有关如何解决此问题的一些想法!

Original Answer原始答案

Disclaimer: This answer applies if you want to define functions locally for better code management, but are okay with their global scope免责声明:如果您想在本地定义函数以获得更好的代码管理,则此答案适用,但可以使用它们的全局 scope

You can use global keyword before defining the function.您可以在定义 function 之前使用 global 关键字。 It will solve the issue of pickling the function (because it is a global function now), meanwhile defining it in local scope.它将解决function(因为它现在是全局function)的酸洗问题,同时在本地scope中定义它。

import multiprocessing as mp

def main():
    global f
    def f(x):
        return x * x

    with mp.Pool(5) as p:
        print(p.map(f, [i for i in range(10)]))

if __name__ == "__main__":
    main()
    print(f(4)) #Inner function is available here as well.

Output: Output:

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
16

Adding another example of having multiple functions with same name, each subsequent function overrides the previous one.添加另一个具有多个同名函数的示例,每个后续 function 都会覆盖前一个。

import multiprocessing as mp

def main():
    global f
    def f(x):
        return x * x

    with mp.Pool(5) as p:
        print(p.map(f, [i for i in range(10)]))

def main2():
    global f
    def f(x):
        return x * x * x

    with mp.Pool(5) as p:
        print(p.map(f, [i for i in range(10)]))

if __name__ == "__main__":
    main()
    main2()
    print(f(4))

Output: Output:

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]
64

Updated Answer更新的答案

Revoke global status, after map is called.在调用 map 后撤销全局状态。 Thanks to @KCQs for hint in the comments.感谢@KCQs 在评论中的提示。

To make sure global functions don't cause any issues for rest of the code, you may simply add del statement for the global function to revoke their global status.为确保全局函数不会对代码的 rest 造成任何问题,您只需为全局 function 添加 del 语句即可撤销其全局状态。

import multiprocessing as mp

def main():
    global f
    def f(x):
        return x * x

    with mp.Pool(5) as p:
        print(p.map(f, [i for i in range(10)]))
    del f

if __name__ == "__main__":
    main()
    print(f(4)) #Inner function is not available.

Output: Output:

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Traceback (most recent call last):
  File "<file>.py", line 25, in <module>
    print(f(4))
NameError: name 'f' is not defined

Although python automatically collects garbage, you may also invoke garbage collector manually .虽然 python 会自动收集垃圾,但您也可以手动调用垃圾收集器

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM