简体   繁体   中英

Pass data to Python multiprocessing.Pool worker processes

I am trying to leverage Pool.map(func, itr) to increase performance of a program, and I need func to access a very large dictionary called cache so it can do a cache lookup.

The cache stores " the binary representations of each of the first 2**16 integers ".

cache = {i: bin(i) for i in range(2**16 - 1)}

The responsibility of func is to count up the number of 1s , or on-bits in the binary representation of the int passed to it:

def func(i: int) -> int:
    return cache[i].count("1")

I want to do something like the following:

with Pool(8) as pool:
    counts = pool.map(func, [i for i in range(2**16-1)])

But how do I make the cache object available to func in each worker subprocess?

A Naive Solution

One can "out-clever" themselves with the following recipe found across the internet:

import functools

cache = {i: bin(i) for i in range(2**16 - 1)}

def func(i: int, cache: Dict[int, str]) -> int:
    return cache[i].count("1")


with Pool(8) as pool:
    # Bind 'cache' to 'func' and pass the partial to map()
    counts = pool.map(functools.partial(func, cache=cache),
                      [i for i in range(2**16-1)])

This works...until you realize that this is actually slower than running w/o parallelization! You end up spending more for serialization/deserialization of your big cache than the ROI you get from parallelizing. See Stuck in a Pickle for a more in-depth explanation.

A Correct Solution

The current "best-practice" for copying data to a Pool worker subprocess is, in one way or another, make the variable global . The pattern looks as follows:

cache = {i: bin(i) for i in range(2**16 - 1)}

def func(i: int) -> int:
    return global_cache[i].count("1")


def make_global(cache: Dict[int, str]) -> None:
    # Declare 'global_cache' to be Global
    global global_cache
    # Update 'global_cache' with a value, now *implicitly* accessible in func
    global_cache = cache


with Pool(8, initializer=make_global, initargs=(cache,)) as pool:
    counts = pool.map(func, [i for i in range(2**16-1)])

This same pattern can be applied to object-oriented code, swapping class attributes for global variables . We buy a bit more encapsulation this way.

A note on the global keyword inside of make_global()'s function body:

The global keyword above declares a variable named global_cache . From the point that this is declared, until the end of the program, global_cache will be accessible with global scope , despite it being declared within a function's scope ( though this won't get "globalized" until a subprocess is forked, isolating the global scope to the worker process ).

A (Proposed) New Solution

There is a 3rd option, though it lives in a CPython fork buried deep, deep in a github repository .

This fork proposes a feature that allows you to do the following:

cache = {i: bin(i) for i in range(2**16 - 1)}

def func(i: int, initret: Dict[int, str]) -> int:
    cache = initret  # Re-assign var for illustrative/readability purposes
    return cache[i].count("1")


def identity(cache: Dict[int, str]) -> Dict[int, str]:
    return cache


with Pool(8, initializer=identity, initargs=(cache,)) as pool:
    counts = pool.map(func, [i for i in range(2**16-1)])

Though it is a small change, it skirts around using globals, and allows for a more readable "flow of data" between parent and worker processes. More on this here .

Essentially, the return value of initializer ( identity() above) is passed to func (as a kwarg named initret ) each time func is called in the worker process.

Note: I am the author of all linked blog posts above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM