对面向队列的函数使用多处理后没有性能提升

Question

The real code I want to optimize is too complicated to be included here, so here is a simplified example:我要优化的实际代码太复杂了，无法包含在此处，因此这里是一个简化示例：

def enumerate_paths(n, k):
    """
    John want to go up a flight of stairs that has N steps. He can take
    up to K steps each time. This function enumerate all different ways
    he can go up this flight of stairs.
    """
    paths = []
    to_analyze = [(0,)]

    while to_analyze:
        path = to_analyze.pop()
        last_step = path[-1]

        if last_step >= n:
            # John has reach the top
            paths.append(path)
            continue

        for i in range(1, k + 1):
            # possible paths from this point
            extended_path = path + (last_step + i,)
            to_analyze.append(extended_path)

    return paths

and the output looks like this输出看起来像这样

>>> enumerate_paths(3, 2)
[(0, 2, 4), (0, 2, 3), (0, 1, 3), (0, 1, 2, 4), (0, 1, 2, 3)]

You may find the result confusing, so here is an explanation.您可能会发现结果令人困惑，因此这里有一个解释。 For example, (0, 1, 2, 4) means John can visit place his foot on the first, second and fourth step chronological, and finally he stops at step 4 because he only need to go up 3 steps.例如， (0, 1, 2, 4)表示约翰可以按时间顺序访问他的脚在第一、第二和第四步，最后他停在第 4 步，因为他只需要上 3 步。

I tried to incorporate multiprocessing into this snippet, but observed no performance gain, not even a little!我试图将multiprocessing合并到这个片段中，但没有观察到性能提升，甚至一点也不！

import multiprocessing

def enumerate_paths_worker(n, k, queue):
    paths = []

    while not queue.empty():
        path = queue.get()
        last_step = path[-1]

        if last_step >= n:
            # John has reach the top
            paths.append(path)
            continue

        for i in range(1, k + 1):
            # possible paths from this point
            extended_path = path + (last_step + i,)
            queue.put(extended_path)

    return paths


def enumerate_paths(n, k):
    pool = multiprocessing.Pool()
    manager = multiprocessing.Manager()
    queue = manager.Queue()

    path_init = (0,)
    queue.put(path_init)
    apply_result = pool.apply_async(enumerate_paths_worker, (n, k, queue))

    return apply_result.get()

The Python list to_analysis acts just like a task queue, and each item in this queue can be processed separately, so I think this function has the potential to be optimized by employing multi-threading/processing. Python list to_analysis就像一个任务队列，这个队列中的每一项都可以单独处理，所以我认为这个函数有可能通过使用多线程/处理来优化。 Also, please note that the order of items doesn't matter.另外，请注意项目的顺序无关紧要。 In fact, when optimizing it, you may return a Python set, a Numpy array, or a Pandas data frame, as long as they represent the same set of paths.实际上，在优化时，您可能会返回 Python 集、Numpy 数组或 Pandas 数据框，只要它们表示相同的路径集即可。

Bonus Question : How much performance can I gain by using scientific packages like Numpy, Pandas or Scipy for a task like this?额外问题：通过使用 Numpy、Pandas 或 Scipy 等科学软件包来完成这样的任务，我可以获得多少性能？

Answer 1

TL;DR TL; 博士

If your real algorithm doesn't involve costlier calculations than you showed us in your example, the communication overhead for multiprocessing will dominate and make your computation take many times longer than sequential execution.如果您的实际算法不涉及比您在示例中向我们展示的计算成本更高的计算，则多处理的通信开销将占主导地位，并使您的计算时间比顺序执行长很多倍。

Your attempt with apply_async actually just uses one worker of your pool, that's why you don't see a difference.您对apply_async尝试实际上只使用了池中的一名工作人员，这就是为什么您看不到差异的原因。 apply_async is just feeding one worker at once by design. apply_async只是按照设计一次喂养一名工人。 Futher it's not enough to just pass the serial version into the pool if your workers need to share intermediate results so you will have to modify your target function to enable that.此外，如果您的工作人员需要共享中间结果，仅将串行版本传递到池中是不够的，因此您必须修改目标函数以启用它。

But as already said in the introduction, your computation will only benefit from multiprocessing if it's heavy enough to earn back the overhead of inter-process communication (and process creation).但是正如在介绍中已经说过的那样，如果计算量足够大以赚取进程间通信（和进程创建）的开销，那么您的计算只会从多处理中受益。

My solution below for the general problem uses JoinableQueue in combination with a sentinel value for process termination, to synchronize the workflow.我下面针对一般问题的解决方案使用JoinableQueue结合进程终止的哨兵值来同步工作流。 I'm adding a function busy_foo to make the computation heavier to show a case where multiprocessing has it's benefits.我正在添加一个函数busy_foo来使计算更重，以显示多处理有好处的情况。

from multiprocessing import Process
from multiprocessing import JoinableQueue as Queue
import time

SENTINEL = 'SENTINEL'

def busy_foo(x = 10e6):
    for _ in range(int(x)):
        x -= 1


def enumerate_paths(q_analyze, q_result, n, k):
    """
    John want to go up a flight of stairs that has N steps. He can take
    up to K steps each time. This function enumerate all different ways
    he can go up this flight of stairs.
    """
    for path in iter(q_analyze.get, SENTINEL):
        last_step = path[-1]

        if last_step >= n:
            busy_foo()
            # John has reach the top
            q_result.put(path)
            q_analyze.task_done()
            continue
        else:
            busy_foo()
            for i in range(1, k + 1):
                # possible paths from this point
                extended_path = path + (last_step + i,)
                q_analyze.put(extended_path)
            q_analyze.task_done()


if __name__ == '__main__':

    N_CORES = 4

    N = 6
    K = 2

    start = time.perf_counter()
    q_analyze = Queue()
    q_result = Queue()

    q_analyze.put((0,))

    pool = []
    for _ in range(N_CORES):
        pool.append(
            Process(target=enumerate_paths, args=(q_analyze, q_result, N, K))
        )

    for p in pool:
        p.start()

    q_analyze.join() # block until everything is processed

    for p in pool:
        q_analyze.put(SENTINEL)  # let the processes exit gracefully

    results = []
    while not q_result.empty():
        results.append(q_result.get())

    for p in pool:
        p.join()

    print(f'elapsed: {time.perf_counter() - start: .2f} s')

Results结果

If I'm using the code above with busy_foo commented out, it takes for N=30, K=2 (2178309 results):如果我使用上面的代码并将busy_foo注释掉，则需要 N=30, K=2（2178309 结果）：

~208s N_CORES=4 ~208s N_CORES=4

2.78s sequential original 2.78s连续原稿

Pickling and Unpickling, threads running against locks etc, account for this huge difference. Pickling 和 Unpickling、针对锁运行的线程等，是造成这种巨大差异的原因。

Now with busy_foo enabled for both and N=6, K=2 (21 results) it takes:现在， busy_foo启用了busy_foo和 N=6，K=2（21 个结果）需要：

6.45s N_CORES=4 6.45s N_CORES=4

30.46s sequential original 30.46s连续原稿

Here the computation was heavy enough to allow the overhead to be earned back.这里的计算量足够大，可以收回开销。

Numpy麻木

Numpy can speed up vectorized operations many times but you likely would see performance penalties with numpy on this one. Numpy 可以多次加速矢量化操作，但您可能会看到 numpy 在这一方面的性能下降。 Numpy uses contiguous blocks of memory for it's arrays. Numpy 为其数组使用连续的内存块。 When you change the array-size the whole array would have to be rebuild again, unlike using python lists.当您更改数组大小时，整个数组必须再次重建，这与使用 python 列表不同。

对面向队列的函数使用多处理后没有性能提升

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-09-15 20:33:19

对面向队列的函数使用多处理后没有性能提升

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-09-15 20:33:19

解决方案1
1 已采纳 2018-09-15 20:33:19