I am trying to leverage Pool.map(func, itr)
to increase performance of a program, and I need func
to access a very large dictionary called cache
so it can do a cache lookup.
The cache
stores " the binary representations of each of the first 2**16
integers ".
cache = {i: bin(i) for i in range(2**16 - 1)}
The responsibility of func
is to count up the number of 1s
, or on-bits in the binary representation of the int
passed to it:
def func(i: int) -> int:
return cache[i].count("1")
I want to do something like the following:
with Pool(8) as pool:
counts = pool.map(func, [i for i in range(2**16-1)])
But how do I make the cache
object available to func
in each worker subprocess?
One can "out-clever" themselves with the following recipe found across the internet:
import functools
cache = {i: bin(i) for i in range(2**16 - 1)}
def func(i: int, cache: Dict[int, str]) -> int:
return cache[i].count("1")
with Pool(8) as pool:
# Bind 'cache' to 'func' and pass the partial to map()
counts = pool.map(functools.partial(func, cache=cache),
[i for i in range(2**16-1)])
This works...until you realize that this is actually slower than running w/o parallelization! You end up spending more for serialization/deserialization of your big cache
than the ROI you get from parallelizing. See Stuck in a Pickle for a more in-depth explanation.
The current "best-practice" for copying data to a Pool worker subprocess is, in one way or another, make the variable global
. The pattern looks as follows:
cache = {i: bin(i) for i in range(2**16 - 1)}
def func(i: int) -> int:
return global_cache[i].count("1")
def make_global(cache: Dict[int, str]) -> None:
# Declare 'global_cache' to be Global
global global_cache
# Update 'global_cache' with a value, now *implicitly* accessible in func
global_cache = cache
with Pool(8, initializer=make_global, initargs=(cache,)) as pool:
counts = pool.map(func, [i for i in range(2**16-1)])
This same pattern can be applied to object-oriented code, swapping class attributes for global variables . We buy a bit more encapsulation this way.
A note on the global
keyword inside of make_global()'s
function body:
The
global
keyword above declares a variable namedglobal_cache
. From the point that this is declared, until the end of the program,global_cache
will be accessible with global scope , despite it being declared within a function's scope ( though this won't get "globalized" until a subprocess is forked, isolating the global scope to the worker process ).
There is a 3rd option, though it lives in a CPython
fork buried deep, deep in a github repository .
This fork proposes a feature that allows you to do the following:
cache = {i: bin(i) for i in range(2**16 - 1)}
def func(i: int, initret: Dict[int, str]) -> int:
cache = initret # Re-assign var for illustrative/readability purposes
return cache[i].count("1")
def identity(cache: Dict[int, str]) -> Dict[int, str]:
return cache
with Pool(8, initializer=identity, initargs=(cache,)) as pool:
counts = pool.map(func, [i for i in range(2**16-1)])
Though it is a small change, it skirts around using globals, and allows for a more readable "flow of data" between parent and worker processes. More on this here .
Essentially, the return value of initializer
( identity()
above) is passed to func
(as a kwarg named initret
) each time func
is called in the worker process.
Note: I am the author of all linked blog posts above.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.