为什么使用 joblib.Parallel() 时我的代码运行速度比不使用时慢得多？

Question

我刚开始使用joblib.Parallel()来加速一些大规模的numpy.fft计算。

我按照joblib -web 上的这个例子

使用示例，我可以在我的计算机上看到以下结果：

Elapsed time computing the average of couple of slices 1.69 s
Elapsed time computing the average of couple of slices 2.64 s
Elapsed time computing the average of couple of slices 0.40 s
Elapsed time computing the average of couple of slices 0.26 s

他们看起来很好！ 然后我将data[s1].mean()更改为np.fft.fft( data[s1] ) ，请参见以下代码：

    import numpy as np

    data = np.random.random((int(2**24),))
    window_size = int(256)
    slices = [slice(start, start + window_size)
          for start in range(0, data.size - window_size, window_size)]
    len(slices)

    import time

    def slow_FFT(data, sl):
        return np.fft.fft(data[sl])

    tic = time.time()
    results = [slow_FFT(data, sl) for sl in slices]
    toc = time.time()
    print('\nElapsed time computing the average of couple of slices {:.2f} s'
      .format(toc - tic))
    np.shape(results)


    from joblib import Parallel, delayed

    tic = time.time()
    results2 = Parallel(n_jobs=4)(delayed(slow_FFT)(data, sl) for sl in slices)
    toc = time.time()
    print('\nElapsed time computing the average of couple of slices {:.2f} s'
      .format(toc - tic))


    import os
    from joblib import dump, load, Parallel

    folder = './joblib5_memmap'
    try:
        os.mkdir(folder)
    except FileExistsError:
        pass
    data_filename_memmap = os.path.join(folder, 'data_memmap')
    dump(data, data_filename_memmap)
    data = load(data_filename_memmap, mmap_mode='r')

    tic = time.time()
    results3 = Parallel(n_jobs=4)(delayed(slow_FFT)(data, sl) for sl in slices)
    toc = time.time()
    print('\nElapsed time computing the average of couple of slices {:.2f} s\n'
      .format(toc - tic))

    def slow_FFT_write_output(data, sl, output, idx):
        res_ = np.fft.fft(data[sl])
        output[idx,:] = res_

    output_filename_memmap = os.path.join(folder, 'output_memmap')
    output = np.memmap(output_filename_memmap, dtype=np.cdouble,shape= 
       (len(slices),window_size), mode='w+')
    data = load(data_filename_memmap, mmap_mode='r')

    tic = time.time()
    _ = Parallel(n_jobs=4)(delayed(slow_FFT_write_output)
        (data, sl, output, idx) for idx, sl in enumerate(slices))
    toc = time.time()
    print('\nElapsed time computing the average of couple of slices {:.2f} s\n'
      .format(toc - tic))

    print(np.allclose(np.array(results),output))

我在“共享内存的可写内存映射”中没有看到 4 核加速

首先，我们将评估我们问题的顺序计算：

Elapsed time computing the average of couple of slices 0.62 s

joblib.Parallel()用于使用 4 个 worker 并行计算所有切片的平均值：

Elapsed time computing the average of couple of slices 4.29 s

并行处理已经比顺序处理更快。 也可以通过将数据数组转储到内存映射并将内存映射传递给joblib.Parallel()来消除一些开销：

Elapsed time computing the average of couple of slices 1.94 s

共享 memory 的可写内存映射：

Elapsed time computing the average of couple of slices 1.46 s
True

有人可以帮我“为什么”吗？ 提前谢谢了！

Answer 1

问： _{“有人能帮我‘为什么’吗？”}

A：
当然，您的代码已经获得了 IMMENSE 附加开销成本，并且它不断重复收集65536 x （很多次：！）：
SER / xfer / DES 附加成本（ [SPACE] -wise as RAM allocations + [TIME] -wise CPU + RAM-I/O delays ）
一次又一次地序列化 + 传输 p2p + 将1.1 [GB]的相同数据 RAM 的块反序列化到 RAM

pass;                                        tic = time.time()
#||||||||||||||||||||||||||||||||||||||||| # CRITICAL SECTION 
results3 = Parallel( n_jobs = 4            # 1.spawn 4 process replicas
                     )( delayed( slow_FFT  # + keep
                                 )( data,  #   feeding them with
                                    sl )   #   <_1.1_GB_data_> + <_sl_>-Objects
                            for     sl     #   for each slice
                                 in slices #       from slices
                        )                  #   again and again 65k+ times
#||||||||||||||||||||||||||||||||||||||||| # CRITICAL SECTION +72 [TB] DATA-FLOW RAM-I/O PAIN
pass;                                        toc = time.time()

这种使用迭代器语法糖的“低成本”SLOC 受到大量非生产性工作的惩罚，因此没有做唯一有用的工作。

重构策略，只支付一次 SER/xfer/DES 附加成本（在n_jobs -processes 的实例化过程中，无论如何都会完成）
并且永远不会传递data ，这些数据在所有复制的n_jobs Python 解释器进程中已经“已知”。 最好制定临时迭代器以在“远程”工作人员内部在大块上自主工作，通过智能调用签名定义，仅被调用一次
（并没有65536 x那么多）

def smartFFT( aTupleOfStartStopShiftINDEX = ( 0, -FFT_WINDOW_SIZE, 1 ) ):
    global FFT_WINDOW_SIZE
    global DATA_IN
    #------------------------
    # compute all FFT-results
    #         for "known" DATA_IN,
    #         for each block from aTupleOfStartStopShiftINDEX[0]
    #                        till aTupleOfStartStopShiftINDEX[1]
    #                 shifting by aTupleOfStartStopShiftINDEX[2]
    #                     of size FFT_WINDOW_SIZE
    #------------------------prefer powers of Numpy vectorized code
    #------------------------best with using smart-striding-tricks
    return block_of_RESULTS_at_once

接下来就是：

pass;                                        tic = time.time()
#||||||||||||||||||||||||||||||||||||||||| # CRITICAL SECTION 
results3 = Parallel( n_jobs = 4            # 1.spawn 4 process replicas
                     )( delayed( smartFFT  # + keep
                                 )( iTup ) #   feeding them with
                            for     iTup   #       just iTup tuple
                                 in iTuples#
                        )                  #   just n_jobs ~ 4 times
#||||||||||||||||||||||||||||||||||||||||| # CRITICAL SECTION +0 [kB] DATA-FLOW
pass;                                        toc = time.time()

所以为什么？

正如阿姆达尔定律所解释的那样，您只是支付了如此多的附加间接费用，以至于根本不可能加速“代码以便至少以某种方式开始并行工作（工作的原子性是经典公式的第二个重要更新，而不是使用它来对抗现实世界中工作包流的性质设备（处理器或处理器网络））。

在加速中付出的代价多于获得的代价——这是“为什么”部分。 没关系 - 多次重复缺陷 - 只能检查频率）

为什么使用 joblib.Parallel() 时我的代码运行速度比不使用时慢得多？

问题描述

1 个解决方案

解决方案1
2 2022-03-11 13:25:42

所以为什么？

为什么使用 joblib.Parallel() 时我的代码运行速度比不使用时慢得多？

问题描述

1 个解决方案

解决方案1 2 2022-03-11 13:25:42

所以为什么？

解决方案1
2 2022-03-11 13:25:42