如何在生成器中使用 python 多处理？

Question

我想在带有生成器函数的 python 中使用多处理

假设我有一个庞大的列表big_list列表，我想使用多处理来计算值。 如果我使用返回值的“传统”函数，这很简单：

import concurrent

def compute_function(list_of_lists):
    return_values = []   ## empty list
    for list in list_of_lists:
        new_value = compute_something(list)    ## compute something; just an example
        return_values.append(new_value)  ## append to list
    return return_values

with concurrent.futures.ProcessPoolExecutor(max_workers=N) as executor:
        new_list = list(executor.map(compute_function, big_list))

但是，以这种方式使用列表太占用内存。 所以我想改用生成器函数：

import concurrent

def generator_function(list_of_lists):
    for list in list_of_lists:
        new_value = compute_something(list)    ## compute something; just an example
        yield new_value

with concurrent.futures.ProcessPoolExecutor(max_workers=N) as executor:
        new_list = list(executor.map(generator_function, big_list))

我的问题是，你不能腌制发电机。 对于其他数据结构，有一些解决此问题的方法，但我认为不适用于生成器。

我怎么能做到这一点？

Answer 1

generator 只是一个保留状态的花哨循环，它类似于迭代器逻辑，它为您提供next 、 hasNext和类似的 api，因此您的循环将要求该迭代器获取下一项（只要它有下一项）

生成器的植入完全取决于开发者，可以通过以下方式实现

将所有数据加载到内存并使用 next 遍历它，从而实现内存效率不高，例如for i in [1,2,3,4]
逐行读取某个文件，例如for line in file
如果生成函数已知，则根据最后生成的元素生成下一个元素，例如在range(100)
以及更多...

都有一个共同的要求，即生成器需要保持其当前状态，以便知道在下一个状态下yield什么，从而使其非常有状态，这反过来又使其在多处理中使用时非常糟糕的选择......

您可以使用 map-reduce 类似逻辑来解决这个问题，并将整个列表拆分为小的子列表，将它们传递给工作人员并将他们的所有输出连接到最终结果

Answer 2

您可以使用itertools.chain.from_iterable在big_list进行更深一层的枚举以迭代子列表。

import concurrent
import itertools

def compute_function(item):
    return compute_something(item)

with concurrent.futures.ProcessPoolExecutor(max_workers=N) as executor:
    for result in executor.map(compute_function,
            itertools.chain.from_iterable(big_list)):
        print(result)

如何在生成器中使用 python 多处理？

问题描述

2 个解决方案

解决方案1
0 2020-03-22 17:21:43

解决方案2
0 2020-03-22 17:24:08

如何在生成器中使用 python 多处理？

问题描述

2 个解决方案

解决方案1 0 2020-03-22 17:21:43

解决方案2 0 2020-03-22 17:24:08

解决方案1
0 2020-03-22 17:21:43

解决方案2
0 2020-03-22 17:24:08