Cython内联函数，以numpy数组作为参数

Question

考虑这样的代码：

import numpy as np
cimport numpy as np

cdef inline inc(np.ndarray[np.int32_t] arr, int i):
    arr[i]+= 1

def test1(np.ndarray[np.int32_t] arr):
    cdef int i
    for i in xrange(len(arr)):
        inc(arr, i)

def test2(np.ndarray[np.int32_t] arr):
    cdef int i
    for i in xrange(len(arr)):
        arr[i] += 1

我使用ipython来测量test1和test2的速度：

In [7]: timeit ttt.test1(arr)
100 loops, best of 3: 6.13 ms per loop

In [8]: timeit ttt.test2(arr)
100000 loops, best of 3: 9.79 us per loop

有没有办法优化test1？ 为什么不把cython内联这个函数告诉？

更新：其实我需要的是这样的多维代码：

# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False

import numpy as np
cimport numpy as np

cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j):
    arr[i, j] += 1

def test1(np.ndarray[np.int32_t, ndim=2] arr):
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc(arr, i, j)


def test2(np.ndarray[np.int32_t, ndim=2] arr):    
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            arr[i,j] += 1

时间安排：

In [7]: timeit ttt.test1(arr)
1 loops, best of 3: 647 ms per loop

In [8]: timeit ttt.test2(arr)
100 loops, best of 3: 2.07 ms per loop

显式内联可提供300倍的加速。 而且我的实际功能非常大，因此内联使代码可维护性更差

UPDATE2：

# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False

import numpy as np
cimport numpy as np

cdef inline inc(np.ndarray[np.float32_t, ndim=2] arr, int i, int j):
  arr[i, j]+= 1

def test1(np.ndarray[np.float32_t, ndim=2] arr):
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc(arr, i, j)


def test2(np.ndarray[np.float32_t, ndim=2] arr):    
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            arr[i,j] += 1    

cdef class FastPassingFloat2DArray(object):
    cdef float* data
    cdef int stride0, stride1 
    def __init__(self, np.ndarray[np.float32_t, ndim=2] arr):
        self.data = <float*>arr.data
        self.stride0 = arr.strides[0]/arr.dtype.itemsize
        self.stride1 = arr.strides[1]/arr.dtype.itemsize
    def __getitem__(self, tuple tp):
        cdef int i, j
        cdef float *pr, r
        i, j = tp        
        pr = (self.data + self.stride0*i + self.stride1*j)
        r = pr[0]
        return r
    def __setitem__(self, tuple tp, float value):
        cdef int i, j
        cdef float *pr, r
        i, j = tp        
        pr = (self.data + self.stride0*i + self.stride1*j)
        pr[0] = value        


cdef inline inc2(FastPassingFloat2DArray arr, int i, int j):
    arr[i, j]+= 1


def test3(np.ndarray[np.float32_t, ndim=2] arr):    
    cdef int i,j    
    cdef FastPassingFloat2DArray tmparr = FastPassingFloat2DArray(arr)
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc2(tmparr, i,j)

时序：

In [4]: timeit ttt.test1(arr)
1 loops, best of 3: 623 ms per loop

In [5]: timeit ttt.test2(arr)
100 loops, best of 3: 2.29 ms per loop

In [6]: timeit ttt.test3(arr)
1 loops, best of 3: 201 ms per loop

Answer 1

问题发布已超过3年，同时取得了很大进展。 在此代码上（问题的更新2）：

# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np

cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j):
    arr[i, j]+= 1

def test1(np.ndarray[np.int32_t, ndim=2] arr):
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc(arr, i, j)

def test2(np.ndarray[np.int32_t, ndim=2] arr):    
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            arr[i,j] += 1

我得到以下时间：

arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test1(arr)
%timeit test2(arr)
   1 loops, best of 3: 354 ms per loop
1000 loops, best of 3: 1.02 ms per loop

所以即使超过3年，这个问题也是可以重现的。 Cython现在已经输入了内存视图 ，AFAIK是在Cython 0.16中引入的，因此在发布问题时不可用。 有了这个：

# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np

cdef inline inc(int[:, ::1] tmv, int i, int j):
    tmv[i, j]+= 1

def test3(np.ndarray[np.int32_t, ndim=2] arr):
    cdef int i,j
    cdef int[:, ::1] tmv = arr
    for i in xrange(tmv.shape[0]):
        for j in xrange(tmv.shape[1]):
            inc(tmv, i, j)

def test4(np.ndarray[np.int32_t, ndim=2] arr):    
    cdef int i,j
    cdef int[:, ::1] tmv = arr
    for i in xrange(tmv.shape[0]):
        for j in xrange(tmv.shape[1]):
            tmv[i,j] += 1

有了这个我得到：

arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test3(arr)
%timeit test4(arr)
1000 loops, best of 3: 977 µs per loop
1000 loops, best of 3: 838 µs per loop

我们几乎在那里，已经比老式的方式更快！ 现在， inc()函数有资格被声明为nogil ，所以让我们声明它！ 但是oops：

Error compiling Cython file:
[...]

cdef inline inc(int[:, ::1] tmv, int i, int j) nogil:
    ^
[...]
Function with Python return type cannot be declared nogil

啊啊，我完全错过了void返回类型！ 再一次，但现在void ：

cdef inline void inc(int[:, ::1] tmv, int i, int j) nogil:
    tmv[i, j]+= 1

最后我得到：

%timeit test3(arr)
%timeit test4(arr)
1000 loops, best of 3: 843 µs per loop
1000 loops, best of 3: 853 µs per loop

和手动内联一样快！

现在，为了好玩，我在这段代码上尝试了Numba ：

import numpy as np
from numba import autojit, jit

@autojit
def inc(arr, i, j):
    arr[i, j] += 1

@autojit
def test5(arr):
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc(arr, i, j)

我明白了：

arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test5(arr)
100 loops, best of 3: 4.03 ms per loop

即使它比Cython慢4.7倍，很可能是因为JIT编译器未能内联inc() ，我认为它真棒！ 我需要做的就是添加@autojit并且不必使用笨拙的类型声明搞乱代码; 88x加速几乎没有！

我曾尝试使用Numba的其他东西，例如

@jit('void(i4[:],i4,i4)')
def inc(arr, i, j):
    arr[i, j] += 1

或nopython=True但未能进一步改进。

改进内联是在Numba开发人员列表中，我们只需要提交更多请求以使其具有更高的优先级。 ;）

Answer 2

您将数组作为numpy.ndarray类型的Python对象传递给inc() 。 由于引用计数等问题，传递Python对象很昂贵，而且似乎阻止了内联。 如果你以C方式传递数组，即作为指针， test1()变得比我机器上的test2()更快：

cimport numpy as np

cdef inline inc(int* arr, int i):
    arr[i] += 1

def test1(np.ndarray[np.int32_t] arr):
    cdef int i
    for i in xrange(len(arr)):
        inc(<int*>arr.data, i)

Answer 3

问题是分配一个numpy数组（或者，等效地，将其作为函数参数传递）不仅仅是一个简单的赋值，而是一个“缓冲区提取”，它填充一个结构并将步幅和指针信息拉出到需要的局部变量中用于快速索引。 如果你正在迭代中等数量的元素，这个O（1）开销很容易在循环中分摊，但对于小函数来说肯定不是这种情况。

对许多人的愿望清单来说，改善这一点很重要，但这是一个非平凡的变化。 例如，请参阅http://groups.google.com/group/cython-users/browse_thread/thread/8fc8686315d7f3fe上的讨论

Cython内联函数，以numpy数组作为参数

问题描述

3 个解决方案

解决方案1
18 2014-07-05 22:17:25

解决方案2
7 2011-01-09 22:05:16

解决方案3
7 已采纳 2011-01-20 01:17:49

Cython内联函数，以numpy数组作为参数

问题描述

3 个解决方案

解决方案1 18 2014-07-05 22:17:25

解决方案2 7 2011-01-09 22:05:16

解决方案3 7 已采纳 2011-01-20 01:17:49

解决方案1
18 2014-07-05 22:17:25

解决方案2
7 2011-01-09 22:05:16

解决方案3
7 已采纳 2011-01-20 01:17:49