Numba: Manual looping faster than a += c * b with numpy arrays?

Question

I would like to do a 'daxpy' (add to a vector the scalar multiple of a second vector and assign the result to the first) with numpy using numba . Doing the following test, I noticed that writing the loop myself was much faster than doing a += c * b .

I was not expecting this. What is the reason for this behavior?

import numpy as np
from numba import jit

x = np.random.random(int(1e6))
o = np.random.random(int(1e6))
c = 3.4

@jit(nopython=True)
def test1(a, b, c):
    a += c * b
    return a

@jit(nopython=True)
def test2(a, b, c):
    for i in range(len(a)):
        a[i] += c * b[i]
    return a

%timeit -n100 -r10 test1(x, o, c)
>>> 100 loops, best of 10: 2.48 ms per loop
%timeit -n100 -r10 test2(x, o, c)
>>> 100 loops, best of 10: 1.2 ms per loop

Answer 1

One thing to keep in mind is 'manual looping' in numba is very fast, essentially the same as the c-loop used by numpy operations.

In the first example there are two operations, a temporary array ( c * b ) is allocated / calculated, then that temporary array is added to a . In the second example, both calculations are happening in the same loop with no intermediate result.

In theory, numba could fuse loops and optimize #1 to do the same as #2, but it doesn't seem to be doing it. If you just want to optimize numpy ops, numexpr may also be worth a look as it was designed for exactly that - though probably won't do any better than the explicit fused loop.

In [17]: import numexpr as ne

In [18]: %timeit -r10 test2(x, o, c)
1000 loops, best of 10: 1.36 ms per loop

In [19]: %timeit ne.evaluate('x + o * c', out=x)
1000 loops, best of 3: 1.43 ms per loop

Numba: Manual looping faster than a += c * b with numpy arrays?

Question

1 answers

solution1
2 ACCPTED 2016-09-06 15:13:48

Numba: Manual looping faster than a += c * b with numpy arrays?

Question

1 answers

solution1 2 ACCPTED 2016-09-06 15:13:48

solution1
2 ACCPTED 2016-09-06 15:13:48