为什么 Numba 优化这个常规的 Python 循环，而不是 numpy 操作？

Question

I wrote this simple test to gauge the performance of Numba and compare it to regular Python and Numpy:我编写了这个简单的测试来衡量 Numba 的性能，并将其与常规的 Python 和 Numpy 进行比较：

import numba.cuda
import numba
import numpy
import time
import math



SIZE = 1000000
ITER = 10
BLOCK = 256



def func_py(result, op1, op2):
    for pos in range(SIZE):
        result[pos] += op1[pos] * op2[pos]


def func_numpy(result, op1, op2):
    result += op1 * op2


@numba.jit(nopython=True)
def func_numba_py(result, op1, op2):
    for pos in range(SIZE):
        result[pos] += op1[pos] * op2[pos]


@numba.jit(nopython=True)
def func_numba_numpy(result, op1, op2):
    result += op1 * op2


@numba.cuda.jit
def func_cuda(result, op1, op2):
    pos = numba.cuda.grid(1)
    if pos < SIZE:
        result[pos] += op1[pos] * op2[pos]



bnum = int(math.ceil(SIZE / BLOCK))



print("Python")
for i in range(ITER):
    result = numpy.random.rand(SIZE)
    op1 = numpy.random.rand(SIZE)
    op2 = numpy.random.rand(SIZE)
    
    t1 = time.perf_counter()
    func_py(result, op1, op2)
    t2 = time.perf_counter()
    
    elapsed = t2 - t1
    print("Call %i | %.2f ms (%.1f Hz)" % (i + 1, elapsed * 1000, 1 / elapsed))
print()


print("Numpy")
for i in range(ITER):
    result = numpy.random.rand(SIZE)
    op1 = numpy.random.rand(SIZE)
    op2 = numpy.random.rand(SIZE)
    
    t1 = time.perf_counter()
    func_numpy(result, op1, op2)
    t2 = time.perf_counter()
    
    elapsed = t2 - t1
    print("Call %i | %.2f ms (%.1f Hz)" % (i + 1, elapsed * 1000, 1 / elapsed))
print()


print("Numba python")
for i in range(ITER):
    result = numpy.random.rand(SIZE)
    op1 = numpy.random.rand(SIZE)
    op2 = numpy.random.rand(SIZE)
    
    t1 = time.perf_counter()
    func_numba_py(result, op1, op2)
    t2 = time.perf_counter()
    
    elapsed = t2 - t1
    print("Call %i | %.2f ms (%.1f Hz)" % (i + 1, elapsed * 1000, 1 / elapsed))
print()


print("Numba_numpy")
for i in range(ITER):
    result = numpy.random.rand(SIZE)
    op1 = numpy.random.rand(SIZE)
    op2 = numpy.random.rand(SIZE)
    
    t1 = time.perf_counter()
    func_numba_numpy(result, op1, op2)
    t2 = time.perf_counter()
    
    elapsed = t2 - t1
    print("Call %i | %.2f ms (%.1f Hz)" % (i + 1, elapsed * 1000, 1 / elapsed))
print()


print("CUDA")
for i in range(ITER):
    result = numpy.random.rand(SIZE)
    op1 = numpy.random.rand(SIZE)
    op2 = numpy.random.rand(SIZE)
    
    t1 = time.perf_counter()
    func_cuda[bnum, BLOCK](result, op1, op2)
    t2 = time.perf_counter()
    
    elapsed = t2 - t1
    print("Call %i | %.2f ms (%.1f Hz)" % (i + 1, elapsed * 1000, 1 / elapsed))

Here are the results:结果如下：

Python
Call 1 | 353.78 ms (2.8 Hz)
Call 2 | 353.26 ms (2.8 Hz)
Call 3 | 356.26 ms (2.8 Hz)
Call 4 | 354.09 ms (2.8 Hz)
Call 5 | 356.45 ms (2.8 Hz)
Call 6 | 375.48 ms (2.7 Hz)
Call 7 | 355.36 ms (2.8 Hz)
Call 8 | 355.85 ms (2.8 Hz)
Call 9 | 356.12 ms (2.8 Hz)
Call 10 | 354.66 ms (2.8 Hz)

Numpy
Call 1 | 4.09 ms (244.7 Hz)
Call 2 | 4.36 ms (229.2 Hz)
Call 3 | 4.11 ms (243.1 Hz)
Call 4 | 3.99 ms (250.6 Hz)
Call 5 | 4.06 ms (246.0 Hz)
Call 6 | 4.55 ms (219.8 Hz)
Call 7 | 4.05 ms (246.9 Hz)
Call 8 | 4.31 ms (232.2 Hz)
Call 9 | 4.14 ms (241.4 Hz)
Call 10 | 4.40 ms (227.2 Hz)

Numba python
Call 1 | 107.88 ms (9.3 Hz)
Call 2 | 1.53 ms (654.1 Hz)
Call 3 | 1.47 ms (681.5 Hz)
Call 4 | 1.42 ms (706.2 Hz)
Call 5 | 1.45 ms (692.0 Hz)
Call 6 | 1.51 ms (664.3 Hz)
Call 7 | 1.48 ms (674.2 Hz)
Call 8 | 1.47 ms (682.5 Hz)
Call 9 | 1.40 ms (716.6 Hz)
Call 10 | 1.44 ms (696.4 Hz)

Numba_numpy
Call 1 | 235.23 ms (4.3 Hz)
Call 2 | 3.88 ms (257.7 Hz)
Call 3 | 4.17 ms (239.6 Hz)
Call 4 | 3.93 ms (254.2 Hz)
Call 5 | 3.90 ms (256.3 Hz)
Call 6 | 3.95 ms (253.1 Hz)
Call 7 | 4.16 ms (240.4 Hz)
Call 8 | 4.08 ms (245.1 Hz)
Call 9 | 3.97 ms (252.0 Hz)
Call 10 | 4.09 ms (244.6 Hz)

CUDA
Call 1 | 258.92 ms (3.9 Hz)
Call 2 | 11.67 ms (85.7 Hz)
Call 3 | 11.21 ms (89.2 Hz)
Call 4 | 12.61 ms (79.3 Hz)
Call 5 | 10.93 ms (91.5 Hz)
Call 6 | 11.21 ms (89.2 Hz)
Call 7 | 10.85 ms (92.2 Hz)
Call 8 | 12.30 ms (81.3 Hz)
Call 9 | 10.85 ms (92.2 Hz)
Call 10 | 10.86 ms (92.1 Hz)

I'm surprised to see that the fastest function here is the one that uses a Numba-optimized loop.我很惊讶地看到这里最快的 function 是使用 Numba 优化循环的那个。 I was under the impression that Numba was capable of optimizing Numpy code too, and I expected to at least see similar performance between func_numba_py and func_numba_numpy .我的印象是 Numba 也能够优化 Numpy 代码，并且我希望至少在func_numba_py和func_numba_numpy之间看到类似的性能。

Why did Numba fail to optimize the simple Numpy function here?为什么 Numba 没有在这里优化简单的 Numpy function？

Answer 1

The Numpy code is not as fast as the Numba code because of temporary arrays .由于临时 arrays ， Numpy 代码不如 Numba 代码快。 Indeed, op1 * op2 produces allocates and write into a temporary array which is then read back by result +=... to finally write into the output array result .实际上， op1 * op2产生分配并写入临时数组，然后由result +=...读回，最终写入 output 数组result 。 Such access are not a problem when the array are in the L1 CPU cache.当阵列位于 L1 CPU 高速缓存中时，此类访问不是问题。 However, the arrays are big here and may not fit in any CPU cache resulting in several slow RAM read/writes.但是，arrays 在这里很大，可能不适合任何 CPU 缓存，导致 RAM 读/写速度慢。

Numba can optimize some Numpy functions, but AFAIK they are not able to merge the computations so it can be done in a row. Numba 可以优化一些 Numpy 函数，但 AFAIK 它们无法合并计算，因此可以连续完成。 Actually, it is not always possible to remove all temporary arrays and this is pretty complex to do it correctly in the general case.实际上，并非总是可以删除所有临时 arrays 并且在一般情况下正确执行此操作非常复杂。 For example result[1:-1] = result[2:] + result[:-2] cannot be done in-place due to aliasing .例如result[1:-1] = result[2:] + result[:-2]由于aliasing而不能就地完成。 Moreover, doing this merge do not always make the code faster due to complex hardware effects (eg. broken vectorization and cache trashing).此外，由于复杂的硬件影响（例如，损坏的矢量化和缓存垃圾），执行此合并并不总是使代码更快。

Note that the @vectorize decorator should help to do this kind of optimization since in this case Numba can know the code does not contain any tricky corner case (eg. like aliasing issues) since the computation is implicitly done element-wise.请注意， @vectorize装饰器应该有助于进行这种优化，因为在这种情况下，Numba 可以知道代码不包含任何棘手的极端情况（例如，像别名问题），因为计算是按元素隐式完成的。

Finally, the GPU code deals with double-precision floating-point numbers while most mainstream Nvidia GPUs targeting personal computers are very slow for computing such numbers.最后，GPU 代码处理双精度浮点数，而大多数针对个人计算机的主流 Nvidia GPU 计算此类数字的速度非常慢。 Consider using simple-precision floating-point numbers (or mixed-precision) if you want to get a fast code on GPUs.如果您想在 GPU 上获得快速代码，请考虑使用简单精度浮点数（或混合精度）。

为什么 Numba 优化这个常规的 Python 循环，而不是 numpy 操作？

问题描述

1 个解决方案

解决方案1
1 2021-06-03 22:50:44

为什么 Numba 优化这个常规的 Python 循环，而不是 numpy 操作？

问题描述

1 个解决方案

解决方案1 1 2021-06-03 22:50:44

解决方案1
1 2021-06-03 22:50:44