[英]Why my code runs so much slower with joblib.Parallel() than without?
我刚开始使用joblib.Parallel()
来加速一些大规模的numpy.fft
计算。
我按照joblib
-web 上的这个例子
使用示例,我可以在我的计算机上看到以下结果:
Elapsed time computing the average of couple of slices 1.69 s
Elapsed time computing the average of couple of slices 2.64 s
Elapsed time computing the average of couple of slices 0.40 s
Elapsed time computing the average of couple of slices 0.26 s
他们看起来很好! 然后我将data[s1].mean()
更改为np.fft.fft( data[s1] )
,请参见以下代码:
import numpy as np
data = np.random.random((int(2**24),))
window_size = int(256)
slices = [slice(start, start + window_size)
for start in range(0, data.size - window_size, window_size)]
len(slices)
import time
def slow_FFT(data, sl):
return np.fft.fft(data[sl])
tic = time.time()
results = [slow_FFT(data, sl) for sl in slices]
toc = time.time()
print('\nElapsed time computing the average of couple of slices {:.2f} s'
.format(toc - tic))
np.shape(results)
from joblib import Parallel, delayed
tic = time.time()
results2 = Parallel(n_jobs=4)(delayed(slow_FFT)(data, sl) for sl in slices)
toc = time.time()
print('\nElapsed time computing the average of couple of slices {:.2f} s'
.format(toc - tic))
import os
from joblib import dump, load, Parallel
folder = './joblib5_memmap'
try:
os.mkdir(folder)
except FileExistsError:
pass
data_filename_memmap = os.path.join(folder, 'data_memmap')
dump(data, data_filename_memmap)
data = load(data_filename_memmap, mmap_mode='r')
tic = time.time()
results3 = Parallel(n_jobs=4)(delayed(slow_FFT)(data, sl) for sl in slices)
toc = time.time()
print('\nElapsed time computing the average of couple of slices {:.2f} s\n'
.format(toc - tic))
def slow_FFT_write_output(data, sl, output, idx):
res_ = np.fft.fft(data[sl])
output[idx,:] = res_
output_filename_memmap = os.path.join(folder, 'output_memmap')
output = np.memmap(output_filename_memmap, dtype=np.cdouble,shape=
(len(slices),window_size), mode='w+')
data = load(data_filename_memmap, mmap_mode='r')
tic = time.time()
_ = Parallel(n_jobs=4)(delayed(slow_FFT_write_output)
(data, sl, output, idx) for idx, sl in enumerate(slices))
toc = time.time()
print('\nElapsed time computing the average of couple of slices {:.2f} s\n'
.format(toc - tic))
print(np.allclose(np.array(results),output))
我在“共享内存的可写内存映射”中没有看到 4 核加速
首先,我们将评估我们问题的顺序计算:
Elapsed time computing the average of couple of slices 0.62 s
joblib.Parallel()
用于使用 4 个 worker 并行计算所有切片的平均值:
Elapsed time computing the average of couple of slices 4.29 s
并行处理已经比顺序处理更快。 也可以通过将数据数组转储到内存映射并将内存映射传递给joblib.Parallel()
来消除一些开销:
Elapsed time computing the average of couple of slices 1.94 s
共享 memory 的可写内存映射:
Elapsed time computing the average of couple of slices 1.46 s
True
有人可以帮我“为什么”吗? 提前谢谢了!
问:
“有人能帮我‘为什么’吗?”
A:
当然,您的代码已经获得了 IMMENSE 附加开销成本,并且它不断重复收集65536 x
(很多次:!):
SER / xfer / DES 附加成本( [SPACE]
-wise as RAM allocations + [TIME]
-wise CPU + RAM-I/O delays )
一次又一次地序列化 + 传输 p2p + 将1.1 [GB]
的相同数据 RAM 的块反序列化到 RAM
pass; tic = time.time()
#||||||||||||||||||||||||||||||||||||||||| # CRITICAL SECTION
results3 = Parallel( n_jobs = 4 # 1.spawn 4 process replicas
)( delayed( slow_FFT # + keep
)( data, # feeding them with
sl ) # <_1.1_GB_data_> + <_sl_>-Objects
for sl # for each slice
in slices # from slices
) # again and again 65k+ times
#||||||||||||||||||||||||||||||||||||||||| # CRITICAL SECTION +72 [TB] DATA-FLOW RAM-I/O PAIN
pass; toc = time.time()
这种使用迭代器语法糖的“低成本”SLOC 受到大量非生产性工作的惩罚,因此没有做唯一有用的工作。
重构策略,只支付一次 SER/xfer/DES 附加成本(在n_jobs
-processes 的实例化过程中,无论如何都会完成)
并且永远不会传递data
,这些数据在所有复制的n_jobs
Python 解释器进程中已经“已知”。 最好制定临时迭代器以在“远程”工作人员内部在大块上自主工作,通过智能调用签名定义,仅被调用一次
(并没有65536 x
那么多)
def smartFFT( aTupleOfStartStopShiftINDEX = ( 0, -FFT_WINDOW_SIZE, 1 ) ):
global FFT_WINDOW_SIZE
global DATA_IN
#------------------------
# compute all FFT-results
# for "known" DATA_IN,
# for each block from aTupleOfStartStopShiftINDEX[0]
# till aTupleOfStartStopShiftINDEX[1]
# shifting by aTupleOfStartStopShiftINDEX[2]
# of size FFT_WINDOW_SIZE
#------------------------prefer powers of Numpy vectorized code
#------------------------best with using smart-striding-tricks
return block_of_RESULTS_at_once
接下来就是:
pass; tic = time.time()
#||||||||||||||||||||||||||||||||||||||||| # CRITICAL SECTION
results3 = Parallel( n_jobs = 4 # 1.spawn 4 process replicas
)( delayed( smartFFT # + keep
)( iTup ) # feeding them with
for iTup # just iTup tuple
in iTuples#
) # just n_jobs ~ 4 times
#||||||||||||||||||||||||||||||||||||||||| # CRITICAL SECTION +0 [kB] DATA-FLOW
pass; toc = time.time()
正如阿姆达尔定律所解释的那样,您只是支付了如此多的附加间接费用,以至于根本不可能加速“代码以便至少以某种方式开始并行工作(工作的原子性是经典公式的第二个重要更新,而不是使用它来对抗现实世界中工作包流的性质设备(处理器或处理器网络))。
在加速中付出的代价多于获得的代价——这是“为什么”部分。 没关系 - 多次重复缺陷 - 只能检查频率)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.