简体   繁体   中英

How can I use python multiprocessing with generators?

I would like to use multiprocessing in python with generator functions

Let's say I have a massive list of lists big_list , and I would like to use multiprocessing to compute values. If I use "traditional" functions which return values, this is straightforward:

import concurrent

def compute_function(list_of_lists):
    return_values = []   ## empty list
    for list in list_of_lists:
        new_value = compute_something(list)    ## compute something; just an example
        return_values.append(new_value)  ## append to list
    return return_values

with concurrent.futures.ProcessPoolExecutor(max_workers=N) as executor:
        new_list = list(executor.map(compute_function, big_list))

However, using lists in this manner is too memory intensive. So I would like to use generator functions instead:

import concurrent

def generator_function(list_of_lists):
    for list in list_of_lists:
        new_value = compute_something(list)    ## compute something; just an example
        yield new_value

with concurrent.futures.ProcessPoolExecutor(max_workers=N) as executor:
        new_list = list(executor.map(generator_function, big_list))

My problem is, you cannot pickle generators. There are some workarounds to this problem for other data structures, but not for generators I think.

How could I accomplish this?

generator are just a fancy loop that preserve the state, it similar to the iterator logic, it provide you with a next , hasNext and similar api, so your loop will ask that iterator for the next item (as long as it has next item)

the implantation of the generator is completely up to the developer, it can be implemented by

  • loading all data to memory and traverse it with next, thus achieve no memory efficient, eg for i in [1,2,3,4]
  • read line by line of some file, eg for line in file
  • if the generation function is known, generate the next element based on the last generated element eg as in range(100)
  • and much more...

all have a common requirement, where the generator need to keep it's current state so it will know what to yield in the next state, thus makes it very much stateful which in turn makes it very bad choice to use in multi-processing...

you can approach this problem with a map-reduce similar logic and split the whole list to small sublists, pass those to the workers and join all their output to the final result

You can do your enumeration one level deeper in big_list using itertools.chain.from_iterable to iterate the sublists.

import concurrent
import itertools

def compute_function(item):
    return compute_something(item)

with concurrent.futures.ProcessPoolExecutor(max_workers=N) as executor:
    for result in executor.map(compute_function,
            itertools.chain.from_iterable(big_list)):
        print(result)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM