How to optimize a nested for loop in Python

Question

So I am trying to write a python function to return a metric called the Mielke-Berry R value. The metric is calculated like so:

The current code I have written works, but because of the sum of sums in the equation, the only thing I could think of to solve it was to use a nested for loop in Python, which is very slow...

Below is my code:

def mb_r(forecasted_array, observed_array):
    """Returns the Mielke-Berry R value."""
    assert len(observed_array) == len(forecasted_array)
    y = forecasted_array.tolist()
    x = observed_array.tolist()
    total = 0
    for i in range(len(y)):
        for j in range(len(y)):
            total = total + abs(y[j] - x[i])
    total = np.array([total])
    return 1 - (mae(forecasted_array, observed_array) * forecasted_array.size ** 2 / total[0])

The reason I converted the input arrays to lists is because I have heard (haven't yet tested) that indexing a numpy array using a python for loop is very slow.

I feel like there may be some sort of numpy function to solve this much faster, anyone know of anything?

Answer 1

Here's one vectorized way to leverage broadcasting to get total -

np.abs(forecasted_array[:,None] - observed_array).sum()

To accept both lists and arrays alike, we can use NumPy builtin for the outer subtraction, like so -

np.abs(np.subtract.outer(forecasted_array, observed_array)).sum()

We can also make use of numexpr module for faster absolute computations and perform summation-reductions in one single numexpr evaluate call and as such would be much more memory efficient, like so -

import numexpr as ne

forecasted_array2D = forecasted_array[:,None]
total = ne.evaluate('sum(abs(forecasted_array2D - observed_array))')

Answer 2

Broadcasting in numpy

If you are not memory constrained, the first step to optimize nested loops in numpy is to use broadcasting and perform operations in a vectorized way:

import numpy as np    

def mb_r(forecasted_array, observed_array):
        """Returns the Mielke-Berry R value."""
        assert len(observed_array) == len(forecasted_array)
        total = np.abs(forecasted_array[:, np.newaxis] - observed_array).sum() # Broadcasting
        return 1 - (mae(forecasted_array, observed_array) * forecasted_array.size ** 2 / total[0])

But while in this case looping occurs in C instead of Python it involves allocation of a size (N, N) array.

Broadcasting is not a panacea, try to unroll the inner loop

As it was noted above broadcasting implies huge memory overhead. So it should be used with care and it is not always the right way. While you may have first impression to use it everywhere - do not . Not so long ago I was also confused by this fact, see my question Numpy ufuncs speed vs for loop speed . Not to be too verbose, I will show this on yours example:

import numpy as np

# Broadcast version
def mb_r_bcast(forecasted_array, observed_array):
    return np.abs(forecasted_array[:, np.newaxis] - observed_array).sum()

# Inner loop unrolled version
def mb_r_unroll(forecasted_array, observed_array):
    size = len(observed_array)
    total = 0.
    for i in range(size):  # There is only one loop
        total += np.abs(forecasted_array - observed_array[i]).sum()
    return total

Small-size arrays (broadcasting is faster)

forecasted = np.random.rand(100)
observed = np.random.rand(100)

%timeit mb_r_bcast(forecasted, observed)
57.5 µs ± 359 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit mb_r_unroll(forecasted, observed)
1.17 ms ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Medium-size arrays (equal)

forecasted = np.random.rand(1000)
observed = np.random.rand(1000)

%timeit mb_r_bcast(forecasted, observed)
15.6 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit mb_r_unroll(forecasted, observed)
16.4 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Large-size arrays (broadcasting is slower)

forecasted = np.random.rand(10000)
observed = np.random.rand(10000)

%timeit mb_r_bcast(forecasted, observed)
1.51 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit mb_r_unroll(forecasted, observed)
377 ms ± 994 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

As you can see for small-sized arrays broadcast version is 20x faster than unrolled, for medium-sized arrays they are rather equal, but for large-sized arrays it is 4x slower because memory overhead is paying its own costly price.

Numba jit and parallelization

Another approach is to use numba and its magic powerful @jit function-decorator. In this case, only slight modification of your initial code is necessary. Also to make loops parallel you should change range to prange and provide parallel=True keyword argument. In the snippet below I use the @njit decorator which is the same as the @jit(nopython=True) :

from numba import njit, prange

@njit(parallel=True)
def mb_r_njit(forecasted_array, observed_array):
    """Returns the Mielke-Berry R value."""
    assert len(observed_array) == len(forecasted_array)
    total = 0.
    size = len(forecasted_array)
    for i in prange(size):
        observed = observed_array[i]
        for j in prange(size):
            total += abs(forecasted_array[j] - observed)
    return 1 - (mae(forecasted_array, observed_array) * size ** 2 / total)

You didn't provide mae function, but to run the code in njit mode you must also decorate mae function, or if it is a number pass it as an argument to the jitted function.

Other options

Python scientific ecosystem is huge, I just mention some other equivalent options to speed up: Cython , Nuitka , Pythran , bottleneck and many others. Perhaps you are interested in gpu computing , but this is actually another story.

Timings

On my computer, unfortunately the old one, the timings are:

import numpy as np
import numexpr as ne

forecasted_array = np.random.rand(10000)
observed_array   = np.random.rand(10000)

initial version

%timeit mb_r(forecasted_array, observed_array)
23.4 s ± 430 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numexpr

%%timeit
forecasted_array2d = forecasted_array[:, np.newaxis]
ne.evaluate('sum(abs(forecasted_array2d - observed_array))')[()]
784 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

broadcast version

%timeit mb_r_bcast(forecasted, observed)
1.47 s ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

inner loop unrolled version

%timeit mb_r_unroll(forecasted, observed)
389 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numba njit(parallel=True) version

%timeit mb_r_njit(forecasted_array, observed_array)
32 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It can be seen that njit approach is 730x faster then your initial solution, and also 24.5x faster than numexpr solution (maybe you need Intel's Vector Math Library to accelerate it). Also simple approach with the inner loop unrolling gives you 60x speed up compared to your initial version. My specs are:

Intel(R) Core(TM)2 Quad CPU Q9550 2.83GHz
Python 3.6.3
numpy 1.13.3
numba 0.36.1
numexpr 2.6.4

Final Note

I was surprised by your phrase "I have heard (haven't yet tested) that indexing a numpy array using a python for loop is very slow." So I test:

arr = np.arange(1000)
ls = arr.tolistist()

%timeit for i in arr: pass
69.5 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit for i in ls: pass
13.3 µs ± 81.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit for i in range(len(arr)): arr[i]
167 µs ± 997 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit for i in range(len(ls)): ls[i]
90.8 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

and it turns out that you are right. It is 2-5x faster to iterate over the list. Of course, these results must be taken with a certain amount of irony :)

Answer 3

As a reference, the following code:

#pythran export mb_r(float64[], float64[])
import numpy as np

def mb_r(forecasted_array, observed_array):
    return np.abs(forecasted_array[:,None] - observed_array).sum()

Runs at the following speed on pure CPython:

% python -m perf timeit -s 'import numpy as np; x = np.random.rand(400); y = np.random.rand(400); from mbr import mb_r' 'mb_r(x, y)' 
.....................
Mean +- std dev: 730 us +- 35 us

And when compiled with Pythran I get

% pythran -march=native -DUSE_BOOST_SIMD mbr.py
% python -m perf timeit -s 'import numpy as np; x = np.random.rand(400); y = np.random.rand(400); from mbr import mb_r' 'mb_r(x, y)'
.....................
Mean +- std dev: 65.8 us +- 1.7 us

So roughly a x10 speedup, on a single core with AVX extension.

How to optimize a nested for loop in Python

Question

3 answers

solution1
9 2017-12-18 21:47:51

solution2
2 ACCPTED 2017-12-18 21:55:03

Broadcasting in numpy

Broadcasting is not a panacea, try to unroll the inner loop

Numba jit and parallelization

Other options

Timings

Final Note

solution3
1 2018-01-02 13:50:44

How to optimize a nested for loop in Python

Question

3 answers

solution1 9 2017-12-18 21:47:51

solution2 2 ACCPTED 2017-12-18 21:55:03

Broadcasting in numpy

Broadcasting is not a panacea, try to unroll the inner loop

Numba jit and parallelization

Other options

Timings

Final Note

solution3 1 2018-01-02 13:50:44

solution1
9 2017-12-18 21:47:51

solution2
2 ACCPTED 2017-12-18 21:55:03

solution3
1 2018-01-02 13:50:44