如何在Python中优化嵌套的for循环

Question

So I am trying to write a python function to return a metric called the Mielke-Berry R value. 所以我试图编写一个python函数来返回一个名为Mielke-Berry R值的度量。 The metric is calculated like so: 度量标准计算如下：

The current code I have written works, but because of the sum of sums in the equation, the only thing I could think of to solve it was to use a nested for loop in Python, which is very slow... 我编写的当前代码有效，但由于等式中的和的总和，我唯一能想到解决它的问题是在Python中使用嵌套的for循环，这非常慢......

Below is my code: 以下是我的代码：

def mb_r(forecasted_array, observed_array):
    """Returns the Mielke-Berry R value."""
    assert len(observed_array) == len(forecasted_array)
    y = forecasted_array.tolist()
    x = observed_array.tolist()
    total = 0
    for i in range(len(y)):
        for j in range(len(y)):
            total = total + abs(y[j] - x[i])
    total = np.array([total])
    return 1 - (mae(forecasted_array, observed_array) * forecasted_array.size ** 2 / total[0])

The reason I converted the input arrays to lists is because I have heard (haven't yet tested) that indexing a numpy array using a python for loop is very slow. 我将输入数组转换为列表的原因是因为我听说（尚未测试）使用python for循环索引一个numpy数组非常慢。

I feel like there may be some sort of numpy function to solve this much faster, anyone know of anything? 我觉得可能有某种numpy功能可以更快地解决这个问题，任何人都知道什么？

Answer 1

Here's one vectorized way to leverage broadcasting to get total - 这是一种利用broadcasting获得total矢量化方式 -

np.abs(forecasted_array[:,None] - observed_array).sum()

To accept both lists and arrays alike, we can use NumPy builtin for the outer subtraction, like so - 要同时接受列表和数组，我们可以使用NumPy内置的外部减法，如下所示 -

np.abs(np.subtract.outer(forecasted_array, observed_array)).sum()

We can also make use of numexpr module for faster absolute computations and perform summation-reductions in one single numexpr evaluate call and as such would be much more memory efficient, like so - 我们还可以利用numexpr模块进行更快的absolute计算，并在一个单一的numexpr evaluate调用中执行summation-reductions ，因此会更加节省内存，就像这样 -

import numexpr as ne

forecasted_array2D = forecasted_array[:,None]
total = ne.evaluate('sum(abs(forecasted_array2D - observed_array))')

Answer 2

Broadcasting in numpy 广播在numpy

If you are not memory constrained, the first step to optimize nested loops in numpy is to use broadcasting and perform operations in a vectorized way: 如果您不受内存限制，优化numpy嵌套循环的第一步是使用广播并以矢量化方式执行操作：

import numpy as np    

def mb_r(forecasted_array, observed_array):
        """Returns the Mielke-Berry R value."""
        assert len(observed_array) == len(forecasted_array)
        total = np.abs(forecasted_array[:, np.newaxis] - observed_array).sum() # Broadcasting
        return 1 - (mae(forecasted_array, observed_array) * forecasted_array.size ** 2 / total[0])

But while in this case looping occurs in C instead of Python it involves allocation of a size (N, N) array. 但是在这种情况下，循环发生在C而不是Python中，它涉及分配大小（N，N）数组。

Broadcasting is not a panacea, try to unroll the inner loop 广播不是灵丹妙药，试图展开内循环

As it was noted above broadcasting implies huge memory overhead. 如上所述，广播意味着巨大的内存开销。 So it should be used with care and it is not always the right way. 所以它应该谨慎使用，并不总是正确的方法。 While you may have first impression to use it everywhere - do not . 虽然你可能有第一印象到处使用它 - 不要。 Not so long ago I was also confused by this fact, see my question Numpy ufuncs speed vs for loop speed . 不久前我也被这个事实搞糊涂了，看看我的问题Numpy ufuncs speed vs for loop speed 。 Not to be too verbose, I will show this on yours example: 不要太冗长，我会在你的例子中展示：

import numpy as np

# Broadcast version
def mb_r_bcast(forecasted_array, observed_array):
    return np.abs(forecasted_array[:, np.newaxis] - observed_array).sum()

# Inner loop unrolled version
def mb_r_unroll(forecasted_array, observed_array):
    size = len(observed_array)
    total = 0.
    for i in range(size):  # There is only one loop
        total += np.abs(forecasted_array - observed_array[i]).sum()
    return total

Small-size arrays (broadcasting is faster) 小型阵列（广播速度更快）

forecasted = np.random.rand(100)
observed = np.random.rand(100)

%timeit mb_r_bcast(forecasted, observed)
57.5 µs ± 359 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit mb_r_unroll(forecasted, observed)
1.17 ms ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Medium-size arrays (equal) 中型阵列（相等）

forecasted = np.random.rand(1000)
observed = np.random.rand(1000)

%timeit mb_r_bcast(forecasted, observed)
15.6 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit mb_r_unroll(forecasted, observed)
16.4 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Large-size arrays (broadcasting is slower) 大尺寸阵列（广播速度较慢）

forecasted = np.random.rand(10000)
observed = np.random.rand(10000)

%timeit mb_r_bcast(forecasted, observed)
1.51 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit mb_r_unroll(forecasted, observed)
377 ms ± 994 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

As you can see for small-sized arrays broadcast version is 20x faster than unrolled, for medium-sized arrays they are rather equal, but for large-sized arrays it is 4x slower because memory overhead is paying its own costly price. 正如您所看到的，对于小型阵列，广播版本比展开的速度快20倍 ，对于中等大小的阵列，它们相当，但对于大型阵列， 速度要慢4倍，因为内存开销正在付出昂贵的代价。

Numba jit and parallelization Numba jit和并行化

Another approach is to use numba and its magic powerful @jit function-decorator. 另一种方法是使用numba及其神奇强大的@jit函数装饰器。 In this case, only slight modification of your initial code is necessary. 在这种情况下，只需稍微修改您的初始代码即可。 Also to make loops parallel you should change range to prange and provide parallel=True keyword argument. 要使循环并行，您应该将range更改为prange并提供parallel=True关键字参数。 In the snippet below I use the @njit decorator which is the same as the @jit(nopython=True) : 在下面的代码片段中，我使用@njit装饰器，它与@jit(nopython=True) ：

from numba import njit, prange

@njit(parallel=True)
def mb_r_njit(forecasted_array, observed_array):
    """Returns the Mielke-Berry R value."""
    assert len(observed_array) == len(forecasted_array)
    total = 0.
    size = len(forecasted_array)
    for i in prange(size):
        observed = observed_array[i]
        for j in prange(size):
            total += abs(forecasted_array[j] - observed)
    return 1 - (mae(forecasted_array, observed_array) * size ** 2 / total)

You didn't provide mae function, but to run the code in njit mode you must also decorate mae function, or if it is a number pass it as an argument to the jitted function. 你没有提供mae函数，但是要在njit模式下运行代码，你还必须装饰mae函数，或者如果它是一个数字，则将它作为参数传递给jitted函数。

Other options 其他选择

Python scientific ecosystem is huge, I just mention some other equivalent options to speed up: Cython , Nuitka , Pythran , bottleneck and many others. Python科学生态系统是巨大的，我只提到了一些其他等效的选项来加速： Cython ， Nuitka ， Pythran ， bottleneck和许多其他。 Perhaps you are interested in gpu computing , but this is actually another story. 也许你对gpu computing感兴趣，但这实际上是另一个故事。

Timings 计时

On my computer, unfortunately the old one, the timings are: 在我的电脑上，不幸的是旧的，时间是：

import numpy as np
import numexpr as ne

forecasted_array = np.random.rand(10000)
observed_array   = np.random.rand(10000)

initial version 初始版本

%timeit mb_r(forecasted_array, observed_array)
23.4 s ± 430 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numexpr

%%timeit
forecasted_array2d = forecasted_array[:, np.newaxis]
ne.evaluate('sum(abs(forecasted_array2d - observed_array))')[()]
784 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

broadcast version 广播版

%timeit mb_r_bcast(forecasted, observed)
1.47 s ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

inner loop unrolled version 内循环展开版

%timeit mb_r_unroll(forecasted, observed)
389 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numba njit(parallel=True) version numba njit(parallel=True)版本

%timeit mb_r_njit(forecasted_array, observed_array)
32 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It can be seen that njit approach is 730x faster then your initial solution, and also 24.5x faster than numexpr solution (maybe you need Intel's Vector Math Library to accelerate it). 由此可以看出， njit的做法是730X快那么你的初步解决方案，同时也24.5x速度比numexpr解决方案（也许你需要英特尔的矢量数学库，以加速它）。 Also simple approach with the inner loop unrolling gives you 60x speed up compared to your initial version. 内圈展开的简单方法与初始版本相比，速度提高了60倍。 My specs are: 我的规格是：

Intel(R) Core(TM)2 Quad CPU Q9550 2.83GHz 英特尔（R）酷睿（TM）2四核CPU Q9550 2.83GHz
Python 3.6.3 Python 3.6.3
numpy 1.13.3 numpy 1.13.3
numba 0.36.1 numba 0.36.1
numexpr 2.6.4 numexpr 2.6.4

Final Note 最后的说明

I was surprised by your phrase "I have heard (haven't yet tested) that indexing a numpy array using a python for loop is very slow." 我很惊讶你的短语“我听说过（还没有测试过）使用python for循环索引一个numpy数组非常慢。” So I test: 所以我测试：

arr = np.arange(1000)
ls = arr.tolistist()

%timeit for i in arr: pass
69.5 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit for i in ls: pass
13.3 µs ± 81.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit for i in range(len(arr)): arr[i]
167 µs ± 997 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit for i in range(len(ls)): ls[i]
90.8 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

and it turns out that you are right. 结果证明你是对的。 It is 2-5x faster to iterate over the list. 迭代列表的速度提高了2-5倍。 Of course, these results must be taken with a certain amount of irony :) 当然，这些结果必须带有一定的反讽:)

Answer 3

As a reference, the following code: 作为参考，以下代码：

#pythran export mb_r(float64[], float64[])
import numpy as np

def mb_r(forecasted_array, observed_array):
    return np.abs(forecasted_array[:,None] - observed_array).sum()

Runs at the following speed on pure CPython: 在纯CPython上以下列速度运行：

% python -m perf timeit -s 'import numpy as np; x = np.random.rand(400); y = np.random.rand(400); from mbr import mb_r' 'mb_r(x, y)' 
.....................
Mean +- std dev: 730 us +- 35 us

And when compiled with Pythran I get 当用Pythran编译时，我得到了

% pythran -march=native -DUSE_BOOST_SIMD mbr.py
% python -m perf timeit -s 'import numpy as np; x = np.random.rand(400); y = np.random.rand(400); from mbr import mb_r' 'mb_r(x, y)'
.....................
Mean +- std dev: 65.8 us +- 1.7 us

So roughly a x10 speedup, on a single core with AVX extension. 所以在具有AVX扩展的单核上大约是x10加速。

如何在Python中优化嵌套的for循环

问题描述

3 个解决方案

解决方案1
9 2017-12-18 21:47:51

解决方案2
2 已采纳 2017-12-18 21:55:03

Broadcasting in numpy 广播在numpy

Broadcasting is not a panacea, try to unroll the inner loop 广播不是灵丹妙药，试图展开内循环

Numba jit and parallelization Numba jit和并行化

Other options 其他选择

Timings 计时

Final Note 最后的说明

解决方案3
1 2018-01-02 13:50:44

如何在Python中优化嵌套的for循环

问题描述

3 个解决方案

解决方案1 9 2017-12-18 21:47:51

解决方案2 2 已采纳 2017-12-18 21:55:03

Broadcasting in numpy 广播在numpy

Broadcasting is not a panacea, try to unroll the inner loop 广播不是灵丹妙药，试图展开内循环

Numba jit and parallelization Numba jit和并行化

Other options 其他选择

Timings 计时

Final Note 最后的说明

解决方案3 1 2018-01-02 13:50:44

解决方案1
9 2017-12-18 21:47:51

解决方案2
2 已采纳 2017-12-18 21:55:03

解决方案3
1 2018-01-02 13:50:44