I have a for loop, and it cost much time. I want to use numba module to speed it up.
My environment is:
win 10
python 3.7.5
anaconda 4.8.3
numpy 0.19.2
numba 0.46.0
The original code is:
def computePoints(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing):
points = []
for row in range(rows):
p = dxFullCurve[row, :]
for col in range(columns):
cprP = p.copy()
cprP = cprP + direction * (col - columns / 2 - relativeOffset[row]) * cprSpacing
points.append(cprP)
return points
if __name__ == '__main__':
dxFullCurve = np.random.random(size=[500, 3])
direction = np.array([1, 0, 0])
rows = 500
columns = 500
relativeOffset = np.random.random(size=500)
cprSpacing = 0.1
import time
t1 = time.time()
for i in range(100):
computePoints(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing)
t2 = time.time()
print('time: ', (t2-t1)/100)
The print time is: 0.8
Then, I use numba to speed it up, and the code is:
@nb.jit()
def computePoints(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing):
points = []
for row in range(rows):
p = dxFullCurve[row, :]
for col in range(columns):
cprP = p.copy()
cprP = cprP + direction * (col - columns / 2 - relativeOffset[row]) * cprSpacing
points.append(cprP)
return points
Now, the time is: 0.177. The numba really speed it up. However, it only speed 4X up. Is there any method to make it faster?
Then, I tried the numba parallel as following:
@nb.jit(nopython=True, parallel=True)
def computePoints(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing):
points = []
for row in range(rows):
p = dxFullCurve[row, :]
for col in range(columns):
cprP = p.copy()
cprP = cprP + direction * (col - columns / 2 - relativeOffset[row]) * cprSpacing
points.append(cprP)
return points
However, the cost time is: 0.903. Unbelievable, it even cost more time than non-numba code.
I just want to know: is there any method to make my for loop faster?
This is a longer comment on @jmd_dk answer. There are a few important points missing which further speeds up the calculation.
parallel=True
Enables parallelization. This is only beneficial if the runtime is large enough. Don't do this if a function only takes a few µs.fastmath=True
-> algebraic changes are allowed, numerically this could have an influence on the result, the programmer has to decide if that is OK. error_model='numpy'
-> turns off check for division by zero, only really needed on a real division this one can be optimized to *0.5 cache=True
If the function is called with inputs of the same datatype the function only has to be loaded from cache if you restart the interpreter. This is especially useful if you have more complicated functionsExample
@nb.njit(fastmath=True,error_model="numpy",parallel=True)
def computePoints_nb_2(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing):
assert dxFullCurve.shape[1]==3
assert direction.shape[0]==3
points = np.empty((rows*columns, 3))
for row in nb.prange(rows):
for col in range(columns):
for i in range(3):
points[row*columns+col, i] = dxFullCurve[row, i] + direction[i] * (col - columns / 2 - relativeOffset[row]) * cprSpacing
return points
If memory allocation can be avoided.
@nb.njit(fastmath=True,error_model="numpy",parallel=True)
def computePoints_nb_2_pre(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing,points):
assert dxFullCurve.shape[1]==3
assert direction.shape[0]==3
assert points.shape[1]==3
for row in nb.prange(rows):
for col in range(columns):
for i in range(3):
points[row*columns+col, i] = dxFullCurve[row, i] + direction[i] * (col - columns / 2 - relativeOffset[row]) * cprSpacing
return points
Timings
#Implementation of jmd_dk
%timeit computePoints_nb_1(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing)
#23.2 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit computePoints_nb_2(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing)
#1.54 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit computePoints_nb_2_pre(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing,points)
#122 µs ± 4.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
You have at least two things that slows down your code:
p.copy()
is unnecessary. Just delete the line cprP = p.copy()
and change to cprP = p + direction * ...
.list
. As far as I can see, all your individual points are arrays of shape (3,)
and you have rows*columns
of them. In the code below I pre-allocate points
as an array and then fill it in during the loop.@nb.jit
def computePoints(dxFullCurve, rows, columns, direction, relativeOffset, cprSpacing):
points = np.empty((rows*columns, 3))
index = 0
for row in range(rows):
p = dxFullCurve[row, :]
for col in range(columns):
cprP = p + direction * (col - columns / 2 - relativeOffset[row]) * cprSpacing
points[index, :] = cprP
index += 1
return points
These two changes result in an additional speedup of 8x on my machine.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.