使用 numba 迭代多个 2d numpy 数组的最快方法

Question

When using numba and accessing elements in multiple 2d numpy arrays, is it better to use the index or to iterate the arrays directly, because I'm finding that a combination of the two is the fastest which seems counterintuitive to me?当使用 numba 并访问多个 2d numpy 数组中的元素时，使用索引还是直接迭代数组更好，因为我发现两者的组合是最快的，这对我来说似乎违反直觉？ Or is there another better way to do it?或者还有其他更好的方法吗？

For context, I am trying to speed up the implementation of the raytracing approach in this paperhttps://iopscience.iop.org/article/10.1088/1361-6560/ac1f38/pdf .对于上下文，我试图加快本文https://iopscience.iop.org/article/10.1088/1361-6560/ac1f38/pdf中光线追踪方法的实施。

I have a function which takes the intensity before propagation and the displacement maps that result from the propagation.我有一个函数，它采用传播前的强度和传播产生的位移图。 The resulting intensity is then the original intensity displaced by the displacement maps pixel by pixel with sub-pixel displacements being proportionately shared between the respective adjacent pixels.所得强度然后是由位移图逐像素位移的原始强度，其中子像素位移在各个相邻像素之间按比例共享。 On a side note, can this be implemented directly in numpy or in another library, as I've noticed it is similar to opencv's remap function.附带说明一下，这可以直接在 numpy 或另一个库中实现，因为我注意到它类似于 opencv 的 remap 函数。

import numpy as np
from numba import njit

@njit
def raytrace_range(intensity_0, d_y, d_x):
    """

    Args:

        intensity_0 (2d numpy array): intensity before propagation
        d_y (2d numpy array): Displacement along y in pixels
        d_x (2d numpy array): Displacement along x in pixels

    Returns:
        intensity_z (2d numpy array): intensity after propagation 

    """
    n_y, n_x = intensity_0.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i in range(n_x):
        for j in range(n_y):
            i_ij = intensity_0[i, j]
            dx_ij=d_x[i,j]
            dy_ij=d_y[i,j]


            # Always the same from here down
            if not dx_ij and not dy_ij:
                intensity_z[i,j]+=i_ij
                continue
            i_new=i
            j_new=j
            #Calculating displacement bigger than a pixel
            if np.abs(dx_ij)>1:
                x = np.floor(dx_ij)
                i_new=int(i+x)
                dx_ij=dx_ij-x
            if np.abs(dy_ij)>1:
                y = np.floor(dy_ij)
                j_new=int(j+y)
                dy_ij=dy_ij-y
            # Calculating sub-pixel displacement
            if 0<=i_new and i_new<n_y and 0<=j_new and j_new<n_x:
                intensity_z[i_new,j_new]+=i_ij*(1-np.abs(dx_ij))*(1-np.abs(dy_ij))
                if i_new<n_y-1 and dx_ij>=0:
                    if j_new<n_y-1 and dy_ij>=0:
                        intensity_z[i_new+1, j_new]+=i_ij*dx_ij*(1-dy_ij)
                        intensity_z[i_new+1, j_new+1]+=i_ij*dx_ij*dy_ij
                        intensity_z[i_new, j_new+1]+=i_ij*(1-dx_ij)*dy_ij
                    if j_new and dy_ij<0:
                        intensity_z[i_new+1,j_new]+=i_ij*dx_ij*(1-np.abs(dy_ij))
                        intensity_z[i_new+1,j_new-1]+=i_ij*dx_ij*np.abs(dy_ij)
                        intensity_z[i_new,j_new-1]+=i_ij*(1-dx_ij)*np.abs(dy_ij)
                if i_new and dx_ij<0:
                    if j_new<n_x-1 and dy_ij>=0:
                        intensity_z[i_new-1,j_new]+=i_ij*np.abs(dx_ij)*(1-dy_ij)
                        intensity_z[i_new-1,j_new+1]+=i_ij*np.abs(dx_ij)*dy_ij
                        intensity_z[i_new,j_new+1]+=i_ij*(1-np.abs(dx_ij))*dy_ij
                    if j_new and dy_ij<0:
                        intensity_z[i_new-1,j_new]+=i_ij*np.abs(dx_ij)*(1-np.abs(dy_ij))
                        intensity_z[i_new-1,j_new-1]+=i_ij*dx_ij*dy_ij
                        intensity_z[i_new,j_new-1]+=i_ij*(1-np.abs(dx_ij))*np.abs(dy_ij)
    return intensity_z

I've tried a few other approaches of which this is the fastest (includes the code from above after the comment # Always the same from here down which I've omitted to keep the question relatively short):我尝试了其他一些最快的方法（包括注释之后的上面的代码# Always the same from here down ，我省略了以保持问题相对简短）：

@njit
def raytrace_enumerate(intensity_0, d_y, d_x):
    n_y, n_x = intensity_0.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i, i_i in enumerate(intensity_0):
        for j, i_ij in enumerate(i_i):
            dx_ij=d_x[i,j]
            dy_ij=d_y[i,j]

@njit
def raytrace_npndenumerate(intensity_0, d_y, d_x):
    n_y, n_x = intensity_0.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for (i, j), i_ij  in np.ndenumerate(intensity_0):
            dx_ij=d_x[i,j]
            dy_ij=d_y[i,j]

@njit
def raytrace_zip(intensity_0, d_y, d_x):
    n_y, n_x = intensity_0.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i, (i_i, dy_i, dx_i) in enumerate(zip(intensity_0, d_y, d_x)):
        for j, (i_ij, dy_ij, dx_ij) in enumerate(zip(i_i, dy_i, dx_i)):

@njit
def raytrace_stack1(idydx):
    n_y, _, n_x = idydx.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i, (i_i, dy_i, dx_i) in enumerate(idydx):
        for j, (i_ij, dy_ij, dx_ij) in enumerate(zip(i_i, dy_i, dx_i)):

@njit
def raytrace_stack2(idydx):
    n_y, n_x, _ = idydx.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i, k in enumerate(idydx):
        for j, (i_ij, dy_ij, dx_ij) in enumerate(k):

Make up some test data and time:补一些测试数据和时间：

import timeit
rng = np.random.default_rng()
size = (2010, 2000)
margin = 10
test_data = np.pad(10000*rng.random(size=size), margin)
dx = np.pad(10*(rng.random(size=size)-0.5), margin)
dy = np.pad(10*(rng.random(size=size)-0.5), margin)

# Check results are the same
L = [raytrace_range(test_data, dy, dx), raytrace_enumerate(test_data, dy, dx), raytrace_npndenumerate(test_data, dy, dx), raytrace_zip(test_data, dy, dx), raytrace_stack1(np.stack([test_data, dy, dx], axis=1)), raytrace_stack2(np.stack([test_data, dy, dx], axis=2))]
print((np.diff(np.vstack(L).reshape(len(L),-1),axis=0)==0).all())

%timeit raytrace_range(test_data, dy, dx)
%timeit raytrace_enumerate(test_data, dy, dx)
%timeit raytrace_npndenumerate(test_data, dy, dx)
%timeit raytrace_zip(test_data, dy, dx)
%timeit raytrace_stack1(np.stack([test_data, dy, dx], axis=1)) #Note this would be the fastest if the arrays were pre-stacked
%timeit raytrace_stack2(np.stack([test_data, dy, dx], axis=2))

Output:输出：

True
40.4 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
37.5 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.8 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
38.6 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
42 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) #Note this would be the fastest if the arrays were pre-stacked
47.4 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answer 1

Edit 3: Turns out that removing if statements make range faster than enumerate .编辑 3：事实证明，删除if语句使range比enumerate更快。 See edit 2 below请参阅下面的编辑 2

Interestingly, in my machine times get awful in the stack1 and stack2 options, and indeed enumerate seems to be fastest.有趣的是，在我的机器中， stack1和stack2选项变得很糟糕，而且确实enumerate似乎是最快的。 Maybe thanks to enumerate numba understands it is a looping variable?:也许感谢enumerate numba 理解它是一个循环变量？：

In [1]: %timeit raytrace_range(test_data, dy, dx)
   ...: %timeit raytrace_enumerate(test_data, dy, dx)
   ...: %timeit raytrace_npndenumerate(test_data, dy, dx)
   ...: %timeit raytrace_zip(test_data, dy, dx)
   ...: %timeit raytrace_stack1(np.stack([test_data, dy, dx], axis=1)) #Note this would be the fastest if the arrays we
   ...: re pre-stacked
   ...: %timeit raytrace_stack2(np.stack([test_data, dy, dx], axis=2))
61 ms ± 785 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
53.9 ms ± 998 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
69.9 ms ± 471 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
57.5 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
109 ms ± 478 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
146 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Edit: Using fastmath=True did not shove up much time, only ~3ms编辑：使用fastmath=True并没有花费太多时间，只有 ~3ms

Edit 2: Although it is not related to the OP's question, after playing a bit with the functions, turns out that removing "superfluous"(*) conditional statements makes it notably faster.编辑 2：虽然它与 OP 的问题无关，但在玩了一些函数之后，事实证明删除“多余的”（*）条件语句会显着加快速度。 Around 20% on my machine.在我的机器上大约 20%。 Turns out the implementation works without them (at least the supplied test returns True ):结果证明没有它们实现工作（至少提供的测试返回True ）：

(*) The operations seem to work regardless, as they are "caught" by the lower operations. (*) 操作似乎无论如何都有效，因为它们被较低的操作“捕获”。 At least, provided test vector did not report any issues.至少，提供的测试向量没有报告任何问题。

#! Using this it is faster:
# Always the same from here down
# if dx_ij==0 and dy_ij==0:
#     intensity_z[i,j]+=i_ij
#     continue
#Calculating displacement bigger than a pixel
x = np.floor(dx_ij)
i_new=int(i+x)
dx_ij=dx_ij-x
y = np.floor(dy_ij)
j_new=int(j+y)
dy_ij=dy_ij-y
# Calculating sub-pixel displacement


In [2]: %timeit raytrace_range(test_data, dy, dx)
   ...: %timeit raytrace_range2(test_data, dy, dx)
   ...: %timeit raytrace_enumerate(test_data, dy, dx)
64.8 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
52.9 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
56.1 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answer 2

In general, the fastest way to iterate over an array is a basic low-level integer iterator.通常，迭代数组的最快方法是基本的低级整数迭代器。 Such a pattern cause the minimum number of transformation in Numba so the compiler should be able to optimize the code pretty well.这种模式导致 Numba 中的转换次数最少，因此编译器应该能够很好地优化代码。 Functions likes zip and enumerate often add an additional overhead due to indirect code transformations that are not perfectly optimized out.由于没有完美优化的间接代码转换，像zip和enumerate这样的函数通常会增加额外的开销。

Here is a basic example:这是一个基本示例：

import numba as nb

@nb.njit('(int_[::1],)')
def test(arr):
    s1 = s2 = 0
    for i in range(arr.shape[0]):
        s1 += i
        s2 += arr[i]
    return (s1, s2)

arr = np.arange(200_000)
test(arr)

However, things are more complex when you read/write to multiple arrays simultaneously (which is your case).但是，当您同时读取/写入多个数组时（您的情况就是这样），事情会变得更加复杂。 Indeed, Numpy array can be indexed with negative indices so Numba need to perform bound checking every time.实际上，Numpy 数组可以使用负索引进行索引，因此 Numba 每次都需要执行边界检查。 This check is expensive compared to the actual access and it can even break some other optimizations (eg. vectorization).与实际访问相比，此检查代价高昂，甚至可能破坏其他一些优化（例如矢量化）。

Consequently, Numba has been optimized so to analyse the code and detect cases where bound checking is not needed and prevent adding expensive checks at runtime.因此，Numba 已经过优化，可以分析代码并检测不需要边界检查的情况，并防止在运行时添加昂贵的检查。 This is the case in the above code but not in your raytrace_range function.在上面的代码中就是这种情况，但在您的raytrace_range函数中却不是。 enumerate and enumerate + zip can help a lot to remove bound checking because Numba can easily prove that the index lies in the bound of the array (theoretically, it could prove this for raytrace_range but the current implementation is unfortunately not smart enough). enumerate和enumerate + zip可以极大地帮助消除边界检查，因为 Numba 可以很容易地证明索引位于数组的边界内（理论上，它可以证明raytrace_range的这一点，但不幸的是当前的实现不够聪明）。 You can mostly solve this problem using assertions .您可以使用断言来解决这个问题。 It is not only good for optimization but also to make your code more robust!它不仅有利于优化，还可以使您的代码更加健壮！

Moreover, the indexing of multidimensional arrays is sometimes not perfectly optimized by the underlying JIT (LLVM-Lite).此外，底层 JIT (LLVM-Lite) 有时无法完美优化多维数组的索引。 There is no reason for them not to be optimized but compiler use heuristics to optimize the code that are far from being perfect (though pretty good in average).没有理由不对其进行优化，但编译器使用启发式方法来优化远非完美的代码（尽管平均而言相当不错）。 You can help by computing views of lines.您可以通过计算线的视图来提供帮助。 This generally result in a tiny improvement though.不过，这通常会导致微小的改进。

Here is the improved code:这是改进的代码：

@njit
def raytrace_range_opt(intensity_0, d_y, d_x):
    n_y, n_x = intensity_0.shape
    assert intensity_0.shape == d_y.shape
    assert intensity_0.shape == d_x.shape

    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)

    for i in range(n_x):
        row_intensity_0 = intensity_0[i, :]
        row_d_x = d_x[i, :]
        row_d_y = d_y[i, :]

        for j in range(n_y):
            assert j >= 0  # Crazy optimization (see later)

            i_ij = row_intensity_0[j]
            dx_ij = row_d_x[j]
            dy_ij = row_d_y[j]

            # Always the same from here down
            if not dx_ij and not dy_ij:
                row_intensity_0[j] += i_ij
                continue

            # Remaining code left unmodified

Notes笔记

Note that I think the indexing of the function raytrace_enumerate is bogus : It should be for i in range(n_y): for j in range(n_x): instead since the access are done with intensity_0[i, j] and you wrote n_y, n_x = intensity_0.shape .请注意，我认为函数raytrace_enumerate的索引是虚假的：它应该是for i in range(n_y): for j in range(n_x):相反，因为访问是使用intensity_0[i, j]完成的，而你写了n_y, n_x = intensity_0.shape 。 Note that swaping the axis also gives correct results based on your validation function (which is suspicious).请注意，交换轴也会根据您的验证功能（这是可疑的）给出正确的结果。

The assert j >= 0 instruction alone results in a 8% speed up which is crazy since the integer iterator j is guaranteed to be positive if the n_x is positive which is always the case since it is a shape!单独的assert j >= 0指令导致 8% 的加速，这很疯狂，因为如果n_x为正，整数迭代器j保证为正，因为它是一个形状，所以总是如此！ This is clearly a missed optimization of Numba that LLVM-Lite cannot optimize (since LLVM-Lite does not know what is a shape and that they are always positive too).这显然是 LLVM-Lite 无法优化的 Numba 优化失误（因为 LLVM-Lite 不知道什么是形状，而且它们也总是积极的）。 This apparent missing assumption in the Numba code causes additional bound checking (of each of the three arrays) that are pretty expensive. Numba 代码中这种明显缺失的假设会导致额外的边界检查（三个数组中的每一个）非常昂贵。

Benchmark基准

Here are results on my machine:这是我机器上的结果：

raytrace_range:           47.8 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_enumerate:       38.9 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_npndenumerate:   54.1 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_zip:             41 ms ± 657 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_stack1:          86.7 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_stack2:          84 ms ± 432 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

raytrace_range_opt:       38.6 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As you can see raytrace_range_opt is the fastest implementation on my machine.如您所见， raytrace_range_opt是我机器上最快的实现。

使用 numba 迭代多个 2d numpy 数组的最快方法

问题描述

2 个解决方案

解决方案1
2 2022-06-01 17:50:45

解决方案2
1 2022-06-01 16:22:35

Notes笔记

Benchmark基准

使用 numba 迭代多个 2d numpy 数组的最快方法

问题描述

2 个解决方案

解决方案1 2 2022-06-01 17:50:45

解决方案2 1 2022-06-01 16:22:35

Notes笔记

Benchmark基准

解决方案1
2 2022-06-01 17:50:45

解决方案2
1 2022-06-01 16:22:35