[英]How to optimize a nested for loop in Python
So I am trying to write a python function to return a metric called the Mielke-Berry R value. 所以我试图编写一个python函数来返回一个名为Mielke-Berry R值的度量。 The metric is calculated like so:
度量标准计算如下:
The current code I have written works, but because of the sum of sums in the equation, the only thing I could think of to solve it was to use a nested for loop in Python, which is very slow... 我编写的当前代码有效,但由于等式中的和的总和,我唯一能想到解决它的问题是在Python中使用嵌套的for循环,这非常慢......
Below is my code: 以下是我的代码:
def mb_r(forecasted_array, observed_array):
"""Returns the Mielke-Berry R value."""
assert len(observed_array) == len(forecasted_array)
y = forecasted_array.tolist()
x = observed_array.tolist()
total = 0
for i in range(len(y)):
for j in range(len(y)):
total = total + abs(y[j] - x[i])
total = np.array([total])
return 1 - (mae(forecasted_array, observed_array) * forecasted_array.size ** 2 / total[0])
The reason I converted the input arrays to lists is because I have heard (haven't yet tested) that indexing a numpy array using a python for loop is very slow. 我将输入数组转换为列表的原因是因为我听说(尚未测试)使用python for循环索引一个numpy数组非常慢。
I feel like there may be some sort of numpy function to solve this much faster, anyone know of anything? 我觉得可能有某种numpy功能可以更快地解决这个问题,任何人都知道什么?
Here's one vectorized way to leverage broadcasting
to get total
- 这是一种利用
broadcasting
获得total
矢量化方式 -
np.abs(forecasted_array[:,None] - observed_array).sum()
To accept both lists and arrays alike, we can use NumPy builtin for the outer subtraction, like so - 要同时接受列表和数组,我们可以使用NumPy内置的外部减法,如下所示 -
np.abs(np.subtract.outer(forecasted_array, observed_array)).sum()
We can also make use of numexpr
module for faster absolute
computations and perform summation-reductions
in one single numexpr evaluate
call and as such would be much more memory efficient, like so - 我们还可以利用
numexpr
模块进行更快的absolute
计算,并在一个单一的numexpr evaluate
调用中执行summation-reductions
,因此会更加节省内存,就像这样 -
import numexpr as ne
forecasted_array2D = forecasted_array[:,None]
total = ne.evaluate('sum(abs(forecasted_array2D - observed_array))')
If you are not memory constrained, the first step to optimize nested loops in numpy
is to use broadcasting and perform operations in a vectorized way: 如果您不受内存限制,优化
numpy
嵌套循环的第一步是使用广播并以矢量化方式执行操作:
import numpy as np
def mb_r(forecasted_array, observed_array):
"""Returns the Mielke-Berry R value."""
assert len(observed_array) == len(forecasted_array)
total = np.abs(forecasted_array[:, np.newaxis] - observed_array).sum() # Broadcasting
return 1 - (mae(forecasted_array, observed_array) * forecasted_array.size ** 2 / total[0])
But while in this case looping occurs in C instead of Python it involves allocation of a size (N, N) array. 但是在这种情况下,循环发生在C而不是Python中,它涉及分配大小(N,N)数组。
As it was noted above broadcasting implies huge memory overhead. 如上所述,广播意味着巨大的内存开销。 So it should be used with care and it is not always the right way.
所以它应该谨慎使用,并不总是正确的方法。 While you may have first impression to use it everywhere - do not .
虽然你可能有第一印象到处使用它 - 不要 。 Not so long ago I was also confused by this fact, see my question Numpy ufuncs speed vs for loop speed .
不久前我也被这个事实搞糊涂了,看看我的问题Numpy ufuncs speed vs for loop speed 。 Not to be too verbose, I will show this on yours example:
不要太冗长,我会在你的例子中展示:
import numpy as np
# Broadcast version
def mb_r_bcast(forecasted_array, observed_array):
return np.abs(forecasted_array[:, np.newaxis] - observed_array).sum()
# Inner loop unrolled version
def mb_r_unroll(forecasted_array, observed_array):
size = len(observed_array)
total = 0.
for i in range(size): # There is only one loop
total += np.abs(forecasted_array - observed_array[i]).sum()
return total
Small-size arrays (broadcasting is faster) 小型阵列(广播速度更快)
forecasted = np.random.rand(100)
observed = np.random.rand(100)
%timeit mb_r_bcast(forecasted, observed)
57.5 µs ± 359 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit mb_r_unroll(forecasted, observed)
1.17 ms ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Medium-size arrays (equal) 中型阵列(相等)
forecasted = np.random.rand(1000)
observed = np.random.rand(1000)
%timeit mb_r_bcast(forecasted, observed)
15.6 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit mb_r_unroll(forecasted, observed)
16.4 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Large-size arrays (broadcasting is slower) 大尺寸阵列(广播速度较慢)
forecasted = np.random.rand(10000)
observed = np.random.rand(10000)
%timeit mb_r_bcast(forecasted, observed)
1.51 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit mb_r_unroll(forecasted, observed)
377 ms ± 994 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
As you can see for small-sized arrays broadcast version is 20x faster than unrolled, for medium-sized arrays they are rather equal, but for large-sized arrays it is 4x slower because memory overhead is paying its own costly price. 正如您所看到的,对于小型阵列,广播版本比展开的速度快20倍 ,对于中等大小的阵列,它们相当 ,但对于大型阵列, 速度要慢4倍,因为内存开销正在付出昂贵的代价。
Another approach is to use numba
and its magic powerful @jit
function-decorator. 另一种方法是使用
numba
及其神奇强大的@jit
函数装饰器。 In this case, only slight modification of your initial code is necessary. 在这种情况下,只需稍微修改您的初始代码即可。 Also to make loops parallel you should change
range
to prange
and provide parallel=True
keyword argument. 要使循环并行,您应该将
range
更改为prange
并提供parallel=True
关键字参数。 In the snippet below I use the @njit
decorator which is the same as the @jit(nopython=True)
: 在下面的代码片段中,我使用
@njit
装饰器,它与@jit(nopython=True)
:
from numba import njit, prange
@njit(parallel=True)
def mb_r_njit(forecasted_array, observed_array):
"""Returns the Mielke-Berry R value."""
assert len(observed_array) == len(forecasted_array)
total = 0.
size = len(forecasted_array)
for i in prange(size):
observed = observed_array[i]
for j in prange(size):
total += abs(forecasted_array[j] - observed)
return 1 - (mae(forecasted_array, observed_array) * size ** 2 / total)
You didn't provide mae
function, but to run the code in njit
mode you must also decorate mae
function, or if it is a number pass it as an argument to the jitted function. 你没有提供
mae
函数,但是要在njit
模式下运行代码,你还必须装饰mae
函数,或者如果它是一个数字,则将它作为参数传递给jitted函数。
Python scientific ecosystem is huge, I just mention some other equivalent options to speed up: Cython
, Nuitka
, Pythran
, bottleneck
and many others. Python科学生态系统是巨大的,我只提到了一些其他等效的选项来加速:
Cython
, Nuitka
, Pythran
, bottleneck
和许多其他。 Perhaps you are interested in gpu computing
, but this is actually another story. 也许你对
gpu computing
感兴趣,但这实际上是另一个故事。
On my computer, unfortunately the old one, the timings are: 在我的电脑上,不幸的是旧的,时间是:
import numpy as np
import numexpr as ne
forecasted_array = np.random.rand(10000)
observed_array = np.random.rand(10000)
initial version 初始版本
%timeit mb_r(forecasted_array, observed_array)
23.4 s ± 430 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numexpr
%%timeit
forecasted_array2d = forecasted_array[:, np.newaxis]
ne.evaluate('sum(abs(forecasted_array2d - observed_array))')[()]
784 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
broadcast version 广播版
%timeit mb_r_bcast(forecasted, observed)
1.47 s ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
inner loop unrolled version 内循环展开版
%timeit mb_r_unroll(forecasted, observed)
389 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numba njit(parallel=True)
version numba
njit(parallel=True)
版本
%timeit mb_r_njit(forecasted_array, observed_array)
32 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It can be seen that njit
approach is 730x faster then your initial solution, and also 24.5x faster than numexpr
solution (maybe you need Intel's Vector Math Library to accelerate it). 由此可以看出,
njit
的做法是730X快那么你的初步解决方案,同时也24.5x速度比numexpr
解决方案(也许你需要英特尔的矢量数学库,以加速它)。 Also simple approach with the inner loop unrolling gives you 60x speed up compared to your initial version. 内圈展开的简单方法与初始版本相比,速度提高了60倍。 My specs are:
我的规格是:
Intel(R) Core(TM)2 Quad CPU Q9550 2.83GHz 英特尔(R)酷睿(TM)2四核CPU Q9550 2.83GHz
Python 3.6.3 Python 3.6.3
numpy 1.13.3 numpy 1.13.3
numba 0.36.1 numba 0.36.1
numexpr 2.6.4 numexpr 2.6.4
I was surprised by your phrase "I have heard (haven't yet tested) that indexing a numpy array using a python for loop is very slow." 我很惊讶你的短语“我听说过(还没有测试过)使用python for循环索引一个numpy数组非常慢。” So I test:
所以我测试:
arr = np.arange(1000)
ls = arr.tolistist()
%timeit for i in arr: pass
69.5 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit for i in ls: pass
13.3 µs ± 81.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit for i in range(len(arr)): arr[i]
167 µs ± 997 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit for i in range(len(ls)): ls[i]
90.8 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and it turns out that you are right. 结果证明你是对的。 It is 2-5x faster to iterate over the list.
迭代列表的速度提高了2-5倍。 Of course, these results must be taken with a certain amount of irony :)
当然,这些结果必须带有一定的反讽:)
As a reference, the following code: 作为参考,以下代码:
#pythran export mb_r(float64[], float64[])
import numpy as np
def mb_r(forecasted_array, observed_array):
return np.abs(forecasted_array[:,None] - observed_array).sum()
Runs at the following speed on pure CPython: 在纯CPython上以下列速度运行:
% python -m perf timeit -s 'import numpy as np; x = np.random.rand(400); y = np.random.rand(400); from mbr import mb_r' 'mb_r(x, y)'
.....................
Mean +- std dev: 730 us +- 35 us
And when compiled with Pythran I get 当用Pythran编译时,我得到了
% pythran -march=native -DUSE_BOOST_SIMD mbr.py
% python -m perf timeit -s 'import numpy as np; x = np.random.rand(400); y = np.random.rand(400); from mbr import mb_r' 'mb_r(x, y)'
.....................
Mean +- std dev: 65.8 us +- 1.7 us
So roughly a x10 speedup, on a single core with AVX extension. 所以在具有AVX扩展的单核上大约是x10加速。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.