简体   繁体   中英

Python multiprocessing problem with dictionary as argument

I have a function that parallizes another function via multiprocessing pool which takes a dictionary as input. I would expect that the code below just prints the number from 0 to 32. However, the result shows that there are many numbers being printed more than once.

Anybody an idea?

import multiprocessing as mp
import numpy as np
import functools

def test(name, t_dict):
    t_dict['a'] = name
    return t_dict

def mp_func(func, iterator ,**kwargs):
    f_args = functools.partial(func, **kwargs)
    pool = mp.Pool(mp.cpu_count())
    res = pool.map(f_args, iterator)
    pool.close()
    return res


mod =dict()

m =33
res = mp_func(func=test, iterator=np.arange(m), t_dict=mod)
for di in res:
    print(di['a'])

The problem is that t_dict is passed as part of the partial function f_args . Partial functions are instances of <class 'functools.partial'> . When you create the partial, it gets a reference to test and the empty dictionary in mod . Every time you call f_args , that one dictionary on the partial object is modified. This is easier to spot with a list in a single process.

>>> def foo(name, t_list):
...     t_list.append(name)
...     return t_list
... 
>>> mod = []
>>> f = functools.partial(foo, t_list=mod)
>>> f(0)
[0]
>>> f(1)
[0, 1]
>>> f(2)
[0, 1, 2]
>>> mod
[0, 1, 2]

When you pool.map(f_args, iterator) , f_args is pickled and sent to each subprocess to be the worker. So, each subprocess has a unique copy of the dictionary that will be updated for every iterated value the subprocess happens to get.

For efficiency, multiprocessing will chunk data. That is, each subprocess is handed a list of iterated values that it will process into a list of responses to return as a group. But since each response is referencing the same single dict, when the chunk is returned to the parent all of the responses only hold the final value set. If 0, 1, 2 were processed, the return is 2, 2, 2 .

The solution will depend on your data. Its expensive to pass data back and forth between pool process and the parent, so ideally the data is generated completely in the worker. In this case, ditch the partial and have the worker create the dict.

Its likely your situation is more complicated than this.

import multiprocessing as mp
import numpy as np

def test(name):
    retrurn ('a':name}

def mp_func(func, iterator ,**kwargs):
    pool = mp.Pool(mp.cpu_count())
    res = pool.map(test, iterator)
    pool.close()
    return res

m =33
res = mp_func(func=test, iterator=np.arange(m))
for di in res:
    print(di['a'])

As everyone is telling you, in general, it is a bad idea to have multiple threads/processes all modifying the same location, and then expecting that location to have the value that your thread gave it.

Your code will run better if all mutating of the shared data structure happens in only one place. So the general plan is:

def worker(key):
    ... calculate value produced by key ...
    return key, value

def runner():
    with mp.Pool() as pool:
       for key, value in pool.imap_unordered(worker, np.arange(m), chunksize=...):
           ... do fast mutation here ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM