python: how to make the numba based for loop faster

Question

I have a for loop, and it cost much time. I want to use numba module to speed it up.

My environment is:

win 10
python 3.7.5
anaconda 4.8.3
numpy 0.19.2
numba 0.46.0

The original code is:

def computePoints(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing):
    points = []
    for row in range(rows):
        p = dxFullCurve[row, :]
        for col in range(columns):
            cprP = p.copy()
            cprP = cprP + direction * (col - columns / 2 - relativeOffset[row]) * cprSpacing
            points.append(cprP)
    return points


if __name__ == '__main__':

    dxFullCurve = np.random.random(size=[500, 3])
    direction = np.array([1, 0, 0])
    rows = 500
    columns = 500
    relativeOffset = np.random.random(size=500)
    cprSpacing = 0.1
    import time
    t1 = time.time()
    for i in range(100):
        computePoints(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing)
    t2 = time.time()
    print('time: ', (t2-t1)/100)

The print time is: 0.8

Then, I use numba to speed it up, and the code is:

@nb.jit()
def computePoints(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing):
    points = []
    for row in range(rows):
        p = dxFullCurve[row, :]
        for col in range(columns):
            cprP = p.copy()
            cprP = cprP + direction * (col - columns / 2 - relativeOffset[row]) * cprSpacing
            points.append(cprP)
    return points

Now, the time is: 0.177. The numba really speed it up. However, it only speed 4X up. Is there any method to make it faster?

Then, I tried the numba parallel as following:

@nb.jit(nopython=True, parallel=True)
def computePoints(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing):
    points = []
    for row in range(rows):
        p = dxFullCurve[row, :]
        for col in range(columns):
            cprP = p.copy()
            cprP = cprP + direction * (col - columns / 2 - relativeOffset[row]) * cprSpacing
            points.append(cprP)
    return points

However, the cost time is: 0.903. Unbelievable, it even cost more time than non-numba code.

I just want to know: is there any method to make my for loop faster?

Answer 1

How to omptimize a Numba function

This is a longer comment on @jmd_dk answer. There are a few important points missing which further speeds up the calculation.

explicit inner loop -> much easier to optimize for the compiler In this case very likely the inner loop is unrolled
removing of uneccessary dependency between loop iterations (the index variable). This makes parallelization possible and is generally a good idea, because a CPU core has more than one calculation unit.
parallel=True Enables parallelization. This is only beneficial if the runtime is large enough. Don't do this if a function only takes a few µs.
fastmath=True -> algebraic changes are allowed, numerically this could have an influence on the result, the programmer has to decide if that is OK.
error_model='numpy' -> turns off check for division by zero, only really needed on a real division this one can be optimized to *0.5
cache=True If the function is called with inputs of the same datatype the function only has to be loaded from cache if you restart the interpreter. This is especially useful if you have more complicated functions
Avoid lists when possible (already mentioned by jmd_dk
use assert statements: This is not only for safety (There is no bounds checking), but also informs the compiler of the exact memory layout. Without knowing that, it is often not possible for the compiler to determine if SIMD-vectorization is beneficial. You don't see much speedup here because this is just copying data with some insignificant multiplications in between, but speedups on other functions could be substancial
futher optimizations: Avoiding a memory allocation if possible. This is by far the most costly part of this whole function (at least if the previous points are implemented).

Example

@nb.njit(fastmath=True,error_model="numpy",parallel=True)
def computePoints_nb_2(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing):
    assert dxFullCurve.shape[1]==3
    assert direction.shape[0]==3
          
    points = np.empty((rows*columns, 3))
    for row in nb.prange(rows):
        for col in range(columns):
            for i in range(3):
                points[row*columns+col, i] = dxFullCurve[row, i] + direction[i] * (col - columns / 2 - relativeOffset[row]) * cprSpacing
    return points

If memory allocation can be avoided.

@nb.njit(fastmath=True,error_model="numpy",parallel=True)
def computePoints_nb_2_pre(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing,points):
    assert dxFullCurve.shape[1]==3
    assert direction.shape[0]==3
    assert points.shape[1]==3

    for row in nb.prange(rows):
        for col in range(columns):
            for i in range(3):
                points[row*columns+col, i] = dxFullCurve[row, i] + direction[i] * (col - columns / 2 - relativeOffset[row]) * cprSpacing
    return points

Timings

#Implementation of jmd_dk
%timeit computePoints_nb_1(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing)
#23.2 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit computePoints_nb_2(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing)
#1.54 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit computePoints_nb_2_pre(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing,points)
#122 µs ± 4.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Answer 2

You have at least two things that slows down your code:

The p.copy() is unnecessary. Just delete the line cprP = p.copy() and change to cprP = p + direction * ... .
Numba is only good with arrays, but you accumulate your results in a list . As far as I can see, all your individual points are arrays of shape (3,) and you have rows*columns of them. In the code below I pre-allocate points as an array and then fill it in during the loop.

@nb.jit
def computePoints(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing):
    points = np.empty((rows*columns, 3))
    index = 0
    for row in range(rows):
        p = dxFullCurve[row, :]
        for col in range(columns):
            cprP = p + direction * (col - columns / 2 - relativeOffset[row]) * cprSpacing
            points[index, :] = cprP
            index += 1
    return points

These two changes result in an additional speedup of 8x on my machine.

python: how to make the numba based for loop faster

Question

2 answers

solution1
4 2020-10-30 17:42:16

How to omptimize a Numba function

solution2
1 ACCPTED 2020-10-30 14:03:53

python: how to make the numba based for loop faster

Question

2 answers

solution1 4 2020-10-30 17:42:16

How to omptimize a Numba function

solution2 1 ACCPTED 2020-10-30 14:03:53

solution1
4 2020-10-30 17:42:16

solution2
1 ACCPTED 2020-10-30 14:03:53