為什么 numba 這么快？

Question

我想寫一個lefts它將采用形狀的索引左側out = (N_ROWS, N_COLS) (N_ROWS,) out[i, j] = 1如果j >= lefts[i] 。 在循環中執行此操作的簡單示例如下：

class Looped(Strategy):
    def copy(self, lefts):
        out = np.zeros([N_ROWS, N_COLS])
        for k, l in enumerate(lefts): 
            out[k, l:] = 1
        return out

現在我希望它盡可能快，所以我對這個 function 有不同的實現：

普通的 python 循環
cython 實現
麻木與@njit
我用ctypes調用的純 c++ 實現

以下是 100 次運行的平均結果：

Looped took 0.0011599776260009093
Cythonised took 0.0006905699110029673
Numba took 8.886413300206186e-05
CPP took 0.00013200821400096175

所以 numba 大約是下一個最快的實現，即 c++ 的 1.5 倍。 我的問題是為什么？

我在類似的問題中聽說過 cython 速度較慢，因為它沒有使用所有優化標志集進行編譯，但是 cpp 實現是使用-O3編譯的，這足以讓我擁有編譯器會給我的所有可能的優化嗎？
我不完全明白如何將 numpy 數組交給 c++，我是不是無意中復制了這里的數據？

# numba implementation

@njit
def numba_copy(lefts):
    out = np.zeros((N_ROWS, N_COLS), dtype=np.float32)
    for k, l in enumerate(lefts): 
        out[k, l:] = 1.
    return out

    
class Numba(Strategy):
    def __init__(self) -> None:
        # avoid compilation time when timing 
        numba_copy(np.array([1]))

    def copy(self, lefts):
        return numba_copy(lefts)


// array copy cpp

extern "C" void copy(const long *lefts,  float *outdatav, int n_rows, int n_cols) 
{   
    for (int i = 0; i < n_rows; i++) {
        for (int j = lefts[i]; j < n_cols; j++){
            outdatav[i*n_cols + j] = 1.;
        }
    }
}

// compiled to a .so using g++ -O3 -shared -o array_copy.so array_copy.cpp

# using cpp implementation

class CPP(Strategy):

    def __init__(self) -> None:
        lib = ctypes.cdll.LoadLibrary("./array_copy.so")
        fun = lib.copy
        fun.restype = None
        fun.argtypes = [
            ndpointer(ctypes.c_long, flags="C_CONTIGUOUS"),
            ndpointer(ctypes.c_float, flags="C_CONTIGUOUS"),
            ctypes.c_long,
            ctypes.c_long,
            ]
        self.fun = fun

    def copy(self, lefts):
        outdata = np.zeros((N_ROWS, N_COLS), dtype=np.float32, )
        self.fun(lefts, outdata, N_ROWS, N_COLS)
        return outdata

帶有時間等的完整代碼：

import time
import ctypes
from itertools import combinations

import numpy as np
from numpy.ctypeslib import ndpointer
from numba import njit


N_ROWS = 1000
N_COLS = 1000


class Strategy:

    def copy(self, lefts):
        raise NotImplementedError

    def __call__(self, lefts):
        s = time.perf_counter()
        n = 1000
        for _ in range(n):
            out = self.copy(lefts)
        print(f"{type(self).__name__} took {(time.perf_counter() - s)/n}")
        return out


class Looped(Strategy):
    def copy(self, lefts):
        out = np.zeros([N_ROWS, N_COLS])
        for k, l in enumerate(lefts): 
            out[k, l:] = 1
        return out


@njit
def numba_copy(lefts):
    out = np.zeros((N_ROWS, N_COLS), dtype=np.float32)
    for k, l in enumerate(lefts): 
        out[k, l:] = 1.
    return out


class Numba(Strategy):
    def __init__(self) -> None:
        numba_copy(np.array([1]))

    def copy(self, lefts):
        return numba_copy(lefts)


class CPP(Strategy):

    def __init__(self) -> None:
        lib = ctypes.cdll.LoadLibrary("./array_copy.so")
        fun = lib.copy
        fun.restype = None
        fun.argtypes = [
            ndpointer(ctypes.c_long, flags="C_CONTIGUOUS"),
            ndpointer(ctypes.c_float, flags="C_CONTIGUOUS"),
            ctypes.c_long,
            ctypes.c_long,
            ]
        self.fun = fun

    def copy(self, lefts):
        outdata = np.zeros((N_ROWS, N_COLS), dtype=np.float32, )
        self.fun(lefts, outdata, N_ROWS, N_COLS)
        return outdata


def copy_over(lefts):
    strategies = [Looped(), Numba(), CPP()]

    outs = []
    for strategy in strategies:
        o = strategy(lefts)
        outs.append(o)

    for s_0, s_1 in combinations(outs, 2):
        for a, b in zip(s_0, s_1):
            np.testing.assert_allclose(a, b)
    

if __name__ == "__main__":
    copy_over(np.random.randint(0, N_COLS, size=N_ROWS))

Answer 1

Numba 目前使用 LLVM-Lite 將代碼高效地編譯為二進制文件（在 Python 代碼已被轉換為 LLVM 中間表示之后）。 代碼像 en C++ 代碼一樣優化，代碼將使用帶有標志-O3和-march=native的 Clang 。 最后一個參數非常重要，因為它使 LLVM 能夠在相對較新的 x86-64 處理器上使用更廣泛的 SIMD 指令：AVX 和 AVX2（對於最近的英特爾處理器可能是 AVX512）。 否則，默認情況下 Clang 和 GCC 僅使用 SSE/SSE2 指令（因為向后兼容）。

另一個區別來自 GCC 和 Numba 的 LLVM 代碼之間的比較。 Clang/LLVM 傾向於積極展開循環，而 GCC 通常不會。 這對生成的程序有顯着的性能影響。 事實上，你可以從 Clang 看到生成的匯編代碼：

使用 Clang（每個循環 128 個項目）：

.LBB0_7:
        vmovups ymmword ptr [r9 + 4*r8 - 480], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 448], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 416], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 384], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 352], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 320], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 288], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 256], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 224], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 192], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 160], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 128], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 96], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 64], ymm0
        vmovups ymmword ptr [r9 + 4*r8 - 32], ymm0
        vmovups ymmword ptr [r9 + 4*r8], ymm0
        sub     r8, -128
        add     rbp, 4
        jne     .LBB0_7

使用 GCC（每個循環 8 個項目）：

.L5:
        mov     rdx, rax
        vmovups YMMWORD PTR [rax], ymm0
        add     rax, 32
        cmp     rdx, rcx
        jne     .L5

因此，公平地說，您需要將 Numba 代碼與使用 Clang 和上述優化標志編譯的 C++ 代碼進行比較。

請注意，根據您的需求和最后一級處理器緩存的大小，您可以使用非臨時存儲（NT 存儲）編寫更快的特定於平台的 C++ 代碼。 NT 存儲告訴處理器不要將數組存儲在其緩存中。 使用 NT 存儲寫入數據在 RAM 中寫入巨大的 arrays 更快，但是如果數組可以放入緩存中，則在復制后讀取存儲的數組時，這可能會變慢（因為必須從 RAM 重新加載數組）。 在您的情況下（4 MiB 陣列），這是否會更快尚不清楚。

Answer 2

結合其他答案/評論中的所有建議，我能夠比 Numba 做得更好：

使用 cython + memoryiews（使用 ctypes 在某處有一些開銷）
優化 cpp 實現
將 cython 編譯器更改為 clang 並設置 -march=skylake

我現在有

CPP took 9.407872100018721e-05
Numba took 9.336918499957392e-05
Cythonised took 9.22323310005595e-05

// array_copy.cpp

#include "array.h" 

const int n_rows = 1000;
const int n_cols = 1000;

void copy(const long *lefts,  float *outdatav) 
{   
    const float one = 1.;
    for (int i = 0; i < n_rows; i++) {
        const int l = lefts[i];
        float* start = outdatav + i*n_cols + l;
        std::fill(start, start + n_cols - l, one);
    }
}

# setup.py
import os

os.environ["CXX"] = "clang"
os.environ["CC"] = "clang"

from setuptools import setup, Extension
from Cython.Build import cythonize

ex = Extension(
    "array_copy_cython", 
    ["./cpp_ext/array_copy_ext.pyx", "./cpp_ext/array.cpp" ],
    include_dirs=["./cpp_ext"],
    extra_compile_args=["-march=skylake"],
    language="c++")
    

setup(
    name='copy',
    ext_modules=cythonize(ex),
    zip_safe=False,
)

# array_copy_ext.pyx

cdef extern from "array.h":
    void copy(const long* lefts,  float* outdatav)

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
def copy_array(const long[:] lefts, float[:,:] outdatav):
    copy(&lefts[0], &outdatav[0][0])
    return outdatav

為什么 numba 這么快？

問題描述

2 個解決方案

解決方案1
1 2021-12-09 23:26:29

解決方案2
0 2022-01-12 11:19:34

為什么 numba 這么快？

問題描述

2 個解決方案

解決方案1 1 2021-12-09 23:26:29

解決方案2 0 2022-01-12 11:19:34

解決方案1
1 2021-12-09 23:26:29

解決方案2
0 2022-01-12 11:19:34