如何使 Cython 比 Python（没有 Numpy）快得多以将两个数组相加？

Question

I want to use Cython to decrease the time it takes to add two arrays together (element-wise) without using Numpy arrays.我想使用 Cython 来减少在不使用 Numpy 数组的情况下将两个数组相加（逐元素）所需的时间。 The basic Python approach that I found to be the fastest is to use list comprehension, as follows:我发现最快的基本 Python 方法是使用列表理解，如下所示：

def add_arrays(a,b):
    return [m + n for m,n in zip(a,b)]

My Cython approach is a little more complicated and it looks as follows:我的 Cython 方法有点复杂，如下所示：

from array import array
from libc.stdlib cimport malloc
from cython cimport boundscheck,wraparound

@boundscheck(False)
@wraparound(False)
cpdef add_arrays_Cython(int[:] Aarr, int[:] Barr):
    cdef size_t i, I
    I = Aarr.shape[0]
    cdef int *Carr = <int *> malloc(640000 * sizeof(int))
    for i in range(I):
        Carr[i] = Aarr[i]+Barr[i]
    result_as_array  = array('i',[e for e in Carr[:640000]])
    return result_as_array

Note that I use @boundscheck(False) and @wraparound(False) to make it even faster.请注意，我使用@boundscheck(False)和@wraparound(False)使其更快。 Also, I am concerned about a very large array (size 640000) and I found it crashes if I simply use cdef int Carr[640000] so I used malloc() , which solved that problem.另外，我担心一个非常大的数组（大小为 640000），我发现如果我只是使用cdef int Carr[640000]它会崩溃，所以我使用了malloc() ，它解决了这个问题。 Lastly, I return the data structure as a Python array of type integer.最后，我将数据结构作为整数类型的 Python 数组返回。

To profile the code I ran the following:为了分析代码，我运行了以下命令：

a = array.array('i', range(640000)) #create integer array
b = a[:] #array to add

T=time.clock()
for i in range(20): add_arrays(a,b) #Python list comprehension approach
print(time.clock() - T)

>6.33 seconds >6.33 秒

T=time.clock()
for i in range(20): add_arrays_Cython(a,b) #Cython approach
print(time.clock() - T)

> 4.54 seconds > 4.54 秒

Evidently, the Cython-based approach gives a speed-up of about 30%.显然，基于 Cython 的方法提供了大约 30% 的加速。 I expected that the speed-up would be closer to an order of magnitude or even more (like it does for Numpy).我预计加速会接近一个数量级甚至更多（就像 Numpy 那样）。

What can I do to speed-up the Cython code further?我该怎么做才能进一步加速 Cython 代码？ Are there any obvious bottlenecks in my code?我的代码中是否有任何明显的瓶颈？ I am a beginner to Cython so I may be misunderstanding something.我是 Cython 的初学者，所以我可能会误解一些东西。

Answer 1

The biggest bottleneck is the conversion of the result pointer back to an array.最大的瓶颈是将结果指针转换回数组。

Here's an optimized version:这是一个优化版本：

from cython cimport boundscheck,wraparound
from cython cimport view

@boundscheck(False)
@wraparound(False)
cpdef add_arrays_Cython(int[:] Aarr, int[:] Barr):
    cdef size_t i, I
    I = Aarr.shape[0]
    result_as_array = view.array(shape=(I,), itemsize=sizeof(int), format='i')
    cdef int[:] Carr = result_as_array
    for i in range(I):
        Carr[i] = Aarr[i]+Barr[i]
    return result_as_array

Few things to note here - instead of malloc'ing a temporary buffer and then copying the result to an array, I create cython.view.array and cast it to a int[:] .这里有几件事需要注意 - 我不是 malloc'ing 临时缓冲区，然后将结果复制到数组，而是创建cython.view.array并将其转换为int[:] 。 This gives me the raw speed of pointer access and also avoids the unnecessary copying.这给了我指针访问的原始速度，也避免了不必要的复制。 I also return the Cython object directly, without converting it to a python object first.我也直接返回了 Cython 对象，而不是先将其转换为 python 对象。 In total, this gives me a 70x speed-up, compared to your original Cython implementation.总的来说，与您最初的 Cython 实现相比，这使我的速度提高了 70 倍。

Converting the view object to a list proved tricky: if you simply change the return statement to return list(result_as_array) , the code became about 10x slower than your initial implementation.将view对象转换为列表被证明是棘手的：如果您简单地将 return 语句更改为return list(result_as_array) ，代码将比您的初始实现慢10 倍。 But if you add an extra layer of wrapping like so: return list(memoryview(result_as_array)) the function was about 5x faster than your version.但是，如果您像这样添加额外的包装层： return list(memoryview(result_as_array))该函数比您的版本快约 5 倍。 So again, the main overhead was going from the fast native object to a generic python one and this should always be avoided, if you need fast code.因此，主要开销是从快速的本机对象到通用的 Python 对象，如果您需要快速代码，则应始终避免这种情况。

For comparison I ran the code with numpy.为了进行比较，我用 numpy 运行了代码。 The numpy version performed exactly as fast as my Cython version. numpy 版本的执行速度与我的 Cython 版本完全一样。 This means that the C compiler was able to automatically vectorize the pairwise summation loop inside my code.这意味着 C 编译器能够在我的代码中自动矢量化成对求和循环。

Side-note: you need to call free() on malloc() 'd pointers, otherwise you leak memory.旁注：你需要在malloc()的指针上调用free() ，否则你会泄漏内存。

如何使 Cython 比 Python（没有 Numpy）快得多以将两个数组相加？

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-03-31 12:09:33

如何使 Cython 比 Python（没有 Numpy）快得多以将两个数组相加？

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-03-31 12:09:33

解决方案1
2 已采纳 2020-03-31 12:09:33