简体   繁体   English

以字典为参数的 Python 多处理问题

[英]Python multiprocessing problem with dictionary as argument

I have a function that parallizes another function via multiprocessing pool which takes a dictionary as input.我有一个函数通过多处理池并行化另一个函数,该池将字典作为输入。 I would expect that the code below just prints the number from 0 to 32. However, the result shows that there are many numbers being printed more than once.我希望下面的代码只打印从 0 到 32 的数字。但是,结果显示有很多数字被打印了不止一次。

Anybody an idea?有人有想法吗?

import multiprocessing as mp
import numpy as np
import functools

def test(name, t_dict):
    t_dict['a'] = name
    return t_dict

def mp_func(func, iterator ,**kwargs):
    f_args = functools.partial(func, **kwargs)
    pool = mp.Pool(mp.cpu_count())
    res = pool.map(f_args, iterator)
    pool.close()
    return res


mod =dict()

m =33
res = mp_func(func=test, iterator=np.arange(m), t_dict=mod)
for di in res:
    print(di['a'])

The problem is that t_dict is passed as part of the partial function f_args .问题是t_dict作为偏函数f_args的一部分f_args Partial functions are instances of <class 'functools.partial'> .偏函数是<class 'functools.partial'>实例。 When you create the partial, it gets a reference to test and the empty dictionary in mod .创建部分时,它会获得对test的引用和mod的空字典。 Every time you call f_args , that one dictionary on the partial object is modified.每次调用f_args ,都会修改部分对象上的那个字典。 This is easier to spot with a list in a single process.在单个过程中使用列表更容易发现这一点。

>>> def foo(name, t_list):
...     t_list.append(name)
...     return t_list
... 
>>> mod = []
>>> f = functools.partial(foo, t_list=mod)
>>> f(0)
[0]
>>> f(1)
[0, 1]
>>> f(2)
[0, 1, 2]
>>> mod
[0, 1, 2]

When you pool.map(f_args, iterator) , f_args is pickled and sent to each subprocess to be the worker.当您pool.map(f_args, iterator)f_args被腌制并发送到每个子f_args成为工作人员。 So, each subprocess has a unique copy of the dictionary that will be updated for every iterated value the subprocess happens to get.因此,每个子进程都有一个唯一的字典副本,该副本将针对子进程碰巧获得的每个迭代值进行更新。

For efficiency, multiprocessing will chunk data.为了效率,多处理将分块数据。 That is, each subprocess is handed a list of iterated values that it will process into a list of responses to return as a group.也就是说,每个子进程都收到一个迭代值列表,它将处理成一个响应列表,以作为一个组返回。 But since each response is referencing the same single dict, when the chunk is returned to the parent all of the responses only hold the final value set.但是由于每个响应都引用相同的单个 dict,当块返回给父级时,所有响应都只保存最终值集。 If 0, 1, 2 were processed, the return is 2, 2, 2 .如果处理了0, 1, 2 ,则返回2, 2, 2

The solution will depend on your data.解决方案将取决于您的数据。 Its expensive to pass data back and forth between pool process and the parent, so ideally the data is generated completely in the worker.在池进程和父进程之间来回传递数据的成本很高,因此理想情况下,数据完全在工作进程中生成。 In this case, ditch the partial and have the worker create the dict.在这种情况下,放弃partial并让工作人员创建字典。

Its likely your situation is more complicated than this.很可能你的情况比这更复杂。

import multiprocessing as mp
import numpy as np

def test(name):
    retrurn ('a':name}

def mp_func(func, iterator ,**kwargs):
    pool = mp.Pool(mp.cpu_count())
    res = pool.map(test, iterator)
    pool.close()
    return res

m =33
res = mp_func(func=test, iterator=np.arange(m))
for di in res:
    print(di['a'])

As everyone is telling you, in general, it is a bad idea to have multiple threads/processes all modifying the same location, and then expecting that location to have the value that your thread gave it.正如每个人都告诉您的那样,一般来说,让多个线程/进程都修改同一个位置,然后期望该位置具有您的线程赋予它的值是一个坏主意。

Your code will run better if all mutating of the shared data structure happens in only one place.如果共享数据结构的所有变异只发生在一处,您的代码将运行得更好。 So the general plan is:所以总体方案是:

def worker(key):
    ... calculate value produced by key ...
    return key, value

def runner():
    with mp.Pool() as pool:
       for key, value in pool.imap_unordered(worker, np.arange(m), chunksize=...):
           ... do fast mutation here ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM