為什么 numba 的簡單就地加法比 numpy 快得多？

Question

根據下面的代碼片段，使用 numba jit 編譯的 function 執行就地添加比使用 numpy 的 ufunc 快約 10 倍。

function 執行多個 numpy 操作是可以理解的，如本問題中所述。

但這里的改進關注點 1 simple numpy ufunc... 那么為什么 numba 快得多？ 我（天真地？）期望 numpy ufunc 以某種方式在內部使用一些編譯代碼，並且像加法這樣簡單的任務已經接近最佳優化了嗎？

更一般地說：我應該期望其他 numpy 函數有如此顯着的性能差異嗎？ 有沒有辦法預測什么時候值得重寫 function 和 numba-jit 呢？

代碼：

import numpy as np
import timeit
import numba

N = 200
target1 = np.ones( N )
target2 = np.ones( N )

# we're going to add these values :
addedValues = np.random.uniform( size=1000000  )
# into these positions : 
indices = np.random.randint(N,size=1000000) 


@numba.njit
def addat(target, index, tobeadded):
    for i in range( index.size):        
        target[index[i]] += tobeadded[i]

# pre-run to jit compile the function
addat( target2, indices, addedValues)
target2 = np.ones( N ) # reset

npaddat = np.add.at
t1 = timeit.timeit( "npaddat( target1, indices, addedValues)", number=3, globals=globals())
t2 = timeit.timeit( "addat( target2, indices, addedValues)", number=3,globals=globals())
assert( (target1==target2).all() )

print("np.add.at time=",t1, )
print("jit-ed addat time =",t2 )

在我的電腦上，我得到：

np.add.at time= 0.21222890191711485
jit-ed addat time = 0.003389443038031459

所以超過 10 倍的改進......

Answer 1

ufunc.add.at()比你的addat()更通用。 它遍歷數組元素並為每個元素調用一些單元操作 function。 讓單元操作 function 成為add_vectors() 。 它添加了兩個輸入向量，其中向量表示數組元素以 C 連續順序排列並對齊。 如果可能，它會使用 SIMD 操作。

因為ufunc.add.at()訪問元素，所以應該為每對輸入元素多次調用add_vectors() 。 但是你的addat()沒有這個懲罰，因為 Numba 生成了一個直接訪問 Numpy 數組元素的機器代碼。

例如，您可以在Numpy源代碼中查看開銷。

對於您關於其他 Numpy 函數性能的第二個問題，我建議您自己進行實驗，因為 Numpy 和 Numba 都在幕后進行如此復雜的操作。（我天真的看法是，為 ufunc 操作編寫良好的 Numba 實現將執行得更好Numpy 實現，因為 Numba 還盡可能使用 SIMD 操作。）

為什么 numba 的簡單就地加法比 numpy 快得多？

問題描述

1 個解決方案

解決方案1
1 已采納 2022-12-01 15:53:47

為什么 numba 的簡單就地加法比 numpy 快得多？

問題描述

1 個解決方案

解決方案1 1 已采納 2022-12-01 15:53:47

解決方案1
1 已采納 2022-12-01 15:53:47