为什么使用这种可并行化的 for 循环，带有 parallel=True 的 numba 会更慢？

Question

I am currently having difficulties understanding why the following code gets slower after including numba parallelization.我目前很难理解为什么在包含 numba 并行化后以下代码会变慢。

Here is the base code without specifying parallelization:这是未指定并行化的基本代码：

@njit('f8[:,::1](f8[:,::1], f8[:,::1], f8[:,::1])', fastmath=True)
def fun(A, B, C):

    n = A.shape[1]

    b00 = B[0,0]
    b02 = B[0,2]

    out = np.empty((n, 12))

    for i in range(n):

        ui = A[0,i]

        c1 = C[0,i]
        c2 = C[1,i]
        c3 = C[2,i]
        c4 = C[3,i]

        out[i, 0] = c1 * b00
        out[i, 1] = 0.
        out[i, 2] = c1 * (b02-ui)
        out[i, 3] = c2 * b00
        out[i, 4] = 0.
        out[i, 5] = c2 * (b02-ui)
        out[i, 6] = c3 * b00
        out[i, 7] = 0.
        out[i, 8] = c3 * (b02-ui)
        out[i, 9] = c4 * b00
        out[i, 10] = 0.
        out[i, 11] = c4 * (b02-ui)

    return out

and here is the parallelized version:这是并行版本：

@njit('f8[:,::1](f8[:,::1], f8[:,::1], f8[:,::1])', fastmath=True, parallel=True)
def fun_parallel(A, B, C):

    n = A.shape[1]

    b00 = B[0,0]
    b02 = B[0,2]

    out = np.empty((n, 12))

    for i in prange(n):

        ui = A[0,i]

        c1 = C[0,i]
        c2 = C[1,i]
        c3 = C[2,i]
        c4 = C[3,i]

        out[i, 0] = c1 * b00
        out[i, 1] = 0.
        out[i, 2] = c1 * (b02-ui)
        out[i, 3] = c2 * b00
        out[i, 4] = 0.
        out[i, 5] = c2 * (b02-ui)
        out[i, 6] = c3 * b00
        out[i, 7] = 0.
        out[i, 8] = c3 * (b02-ui)
        out[i, 9] = c4 * b00
        out[i, 10] = 0.
        out[i, 11] = c4 * (b02-ui)

    return out

Measuring execution times with perfplot and the following code:使用perfplot和以下代码测量执行时间：

B = np.random.rand(3,3)

perfplot.show(
    setup=lambda n: (np.random.rand(2, n), np.random.rand(4, n)),  # or setup=np.random.rand
    kernels=[
        lambda A, C: fun(A, B, C),
        lambda A, C: fun_parallel(A, B, C),
    ],
    labels=["fun", "fun_parallel"],
    n_range=[2**k for k in range(15)],
    xlabel="n",
    show_progress=False,
)

gives the following performances for a varying size of the arrays.针对不同尺寸的 arrays，给出了以下性能。

showing a noticeable increase in time execution with the parallelized version.显示并行化版本的时间执行显着增加。

Any help on understanding why does this happen is much appreciated.非常感谢任何有助于理解为什么会发生这种情况的帮助。

Answer 1

TL;DR : parallelism only worth it for relatively-long compute-bound operations, clearly not for short memory-bound ones. TL;DR ：并行性只值得相对较长的计算绑定操作，显然不适合短的内存绑定操作。 Please consider merging algorithms so to make it more compute-bound.请考虑合并算法，使其更受计算限制。

When n is small, the overhead of creating threads or distributing the work (regarding the target parallel backend) is much bigger than the computation, as pointed out by Ali_Sh in the comments (see this past answer).当 n 很小时，创建线程或分配工作的开销（关于目标并行后端）比计算大得多，正如 Ali_Sh 在评论中指出的那样（参见过去的答案）。 Indeed, this is consistent with a constant overhead for small n values below 5e3 (though the time is unstable certainly because of OS syscalls, NUMA effects and non-deterministic synchronizations).实际上，这与小于 5e3 的小 n 值的恒定开销是一致的（尽管由于操作系统系统调用、NUMA 效应和非确定性同步，时间肯定是不稳定的）。 The bigger the number of core, the bigger the overhead .核心数越大，开销越大。 Thus, parallelism can be beneficial on some machines (like my PC with 6 cores when n > 1e3 ).因此，并行性在某些机器上可能是有益的（比如我的 PC 在n > 1e3时具有 6 个内核）。 Reducing the number of thread should help a bit.减少线程数应该会有所帮助。

When n is big, using multiple threads do not provide a big speed-up because the operation is memory bound .当 n 很大时，使用多个线程不会提供很大的加速，因为该操作是memory bound 。 Indeed, the RAM is shared between cores and few cores are generally enough to saturate it (if not 1 core on some PC).实际上，RAM 是在内核之间共享的，并且通常很少有内核足以使其饱和（如果不是某些 PC 上的 1 个内核）。 On my PC designed for reaching a high RAM throughput, 2 cores of my i5-9600KF processor are able to saturate the RAM with a throughput close to 40 GiB/s.在我为实现高 RAM 吞吐量而设计的 PC 上，我的 i5-9600KF 处理器的 2 个内核能够以接近 40 GiB/s 的吞吐量使 RAM 饱和。 Note that writing data in newly-created large array is not very efficient in Numba/Numpy.请注意，在新创建的大数组中写入数据在 Numba/Numpy 中效率不高。 This is because the OS needs to do page-faults (generally during the first-touch , or during the np.empty on few system).这是因为操作系统需要执行页面错误（通常在first-touch期间，或者在少数系统上的np.empty期间）。 One can pre-allocate the output array once so to avoid page-faults being done once again for each function call (assuming you can reuse the array).可以预先分配一次 output 数组，以避免为每个 function 调用再次执行页面错误（假设您可以重用该数组）。 In addition, the write-allocate cache policy cause data to be read from memory so to write it which just waste half the memory bandwidth .此外，写分配缓存策略会导致数据从 memory 中读取，因此写入它只会浪费 memory 带宽的一半。 there is no way to make that faster in Numba yet (there is an open issue on this).在 Numba 中还没有办法让它更快（这方面有一个未解决的问题）。 C/C++ codes can speed this up using non-temporal instructions on x86-64 platforms. C/C++ 代码可以使用 x86-64 平台上的非临时指令来加速这一过程。

When n is sufficiently big for the parallel overhead to be relatively small and the computation is done in the CPU cache , using multiple thread can significantly help.当 n 足够大以使并行开销相对较小并且计算在CPU 缓存中完成时，使用多线程可以显着帮助。 That being said, this is a close window and this is probably what happen on your machine for 5e3 < n < 1e4 .话虽如此，这是一个接近的 window ，这可能是您的机器上发生的5e3 < n < 1e4 。

Memory has been slow for few decades and it tends to be even slower than CPUs over time. Memory 几十年来一直很慢，随着时间的推移，它往往比 CPU 还要慢。 This effect, called the "memory wall" , was conjectured several decades ago and has been confirmed to be true so far.这种被称为“记忆墙”的效应是几十年前的猜想，至今已被证实是真的。 Thus, it is not expected for memory-bound codes to scale well any time soon (quite the opposite in fact).因此，预计内存绑定代码不会很快很好地扩展（实际上恰恰相反）。 The only way to overcome this problem is to avoid operating on big buffers: data should be computed on the fly and using CPU caches as much as possible (especially the L1/L2 if possible).克服这个问题的唯一方法是避免在大缓冲区上操作：数据应该动态计算并尽可能多地使用 CPU 缓存（如果可能，尤其是 L1/L2）。 One should also prefer compute-bound algorithms over memory-bound one (for a similar amount of work).人们还应该更喜欢计算绑定算法而不是内存绑定算法（对于类似的工作量）。 Recomputing data rather than pre-computing large buffers can be faster if the amount of computation is small.如果计算量很小，重新计算数据而不是预先计算大缓冲区可能会更快。

为什么使用这种可并行化的 for 循环，带有 parallel=True 的 numba 会更慢？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-08-27 12:37:47

为什么使用这种可并行化的 for 循环，带有 parallel=True 的 numba 会更慢？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-08-27 12:37:47

解决方案1
1 已采纳 2022-08-27 12:37:47