为什么 numba 的简单就地加法比 numpy 快得多？

Question

According to the snippet below, performing an in-place addition with a numba jit-compiled function is ~10 times faster than with numpy's ufunc.根据下面的代码片段，使用 numba jit 编译的 function 执行就地添加比使用 numpy 的 ufunc 快约 10 倍。

This would be understandable with a function performing multiple numpy operations as explained in this question . function 执行多个 numpy 操作是可以理解的，如本问题中所述。

But here the improvement concern 1 simple numpy ufunc... So why is numba so much faster?但这里的改进关注点 1 simple numpy ufunc... 那么为什么 numba 快得多？ I'm (naively?) expecting that the numpy ufunc somehow internally uses some compiled code and that a task as simple as an addition would already be close to optimally optimized?我（天真地？）期望 numpy ufunc 以某种方式在内部使用一些编译代码，并且像加法这样简单的任务已经接近最佳优化了吗？

More generally: should I expect such dramatic performance differences for other numpy functions?更一般地说：我应该期望其他 numpy 函数有如此显着的性能差异吗？ Is there a way to predict when it's worth to re-write a function and numba-jit it?有没有办法预测什么时候值得重写 function 和 numba-jit 呢？

the code:代码：

import numpy as np
import timeit
import numba

N = 200
target1 = np.ones( N )
target2 = np.ones( N )

# we're going to add these values :
addedValues = np.random.uniform( size=1000000  )
# into these positions : 
indices = np.random.randint(N,size=1000000) 


@numba.njit
def addat(target, index, tobeadded):
    for i in range( index.size):        
        target[index[i]] += tobeadded[i]

# pre-run to jit compile the function
addat( target2, indices, addedValues)
target2 = np.ones( N ) # reset

npaddat = np.add.at
t1 = timeit.timeit( "npaddat( target1, indices, addedValues)", number=3, globals=globals())
t2 = timeit.timeit( "addat( target2, indices, addedValues)", number=3,globals=globals())
assert( (target1==target2).all() )

print("np.add.at time=",t1, )
print("jit-ed addat time =",t2 )

on my computer I get:在我的电脑上，我得到：

np.add.at time= 0.21222890191711485
jit-ed addat time = 0.003389443038031459

so more than a factor 10 improvement...所以超过 10 倍的改进......

Answer 1

The ufunc.add.at() is much more generic then your addat() . ufunc.add.at()比你的addat()更通用。 It iterates over the array elements and calls some unit operation function for each element.它遍历数组元素并为每个元素调用一些单元操作 function。 Let the unit operation function be add_vectors() .让单元操作 function 成为add_vectors() 。 It adds two input vectors, where a vector means array elements in C-contiguous order and aligned.它添加了两个输入向量，其中向量表示数组元素以 C 连续顺序排列并对齐。 It utilizes SIMD operations if possible.如果可能，它会使用 SIMD 操作。

Because the ufunc.add.at() accesses elements randomly(not sequentially), the add_vectors() should be called multiple times for each pair of input elements.因为ufunc.add.at()访问元素，所以应该为每对输入元素多次调用add_vectors() 。 But your addat() does not have this penalty because Numba generates a machine code that accesses Numpy array elements directly.但是你的addat()没有这个惩罚，因为 Numba 生成了一个直接访问 Numpy 数组元素的机器代码。

You can see the overhead in the Numpy source at this and this for example.例如，您可以在Numpy源代码中查看开销。

For your second question on the performance of other Numpy functions, I recommend to experiment by yourself, because both Numpy and Numba do so complex operations behind the scene.(My naive opinion is that a well written Numba implementation for a ufunc operation will perform better than the Numpy implementation, because Numba also utilizes SIMD operations if possible.)对于您关于其他 Numpy 函数性能的第二个问题，我建议您自己进行实验，因为 Numpy 和 Numba 都在幕后进行如此复杂的操作。（我天真的看法是，为 ufunc 操作编写良好的 Numba 实现将执行得更好Numpy 实现，因为 Numba 还尽可能使用 SIMD 操作。）

为什么 numba 的简单就地加法比 numpy 快得多？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-12-01 15:53:47

为什么 numba 的简单就地加法比 numpy 快得多？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-12-01 15:53:47

解决方案1
1 已采纳 2022-12-01 15:53:47