为什么Cython比矢量化NumPy慢？

Question

Consider the following Cython code : 考虑以下Cython代码：

cimport cython
cimport numpy as np
import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
    cdef int i
    for i in range(a.shape[0]):
        a[i] += b[i]

@cython.boundscheck(False)
@cython.wraparound(False)
def test_numpy(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
    cdef int i
    for i in range(a.shape[0]):
        a[i] += b[i]

def test_numpyvec(a, b):
    a += b

def gendata(nb=40000000):
    a = np.random.random(nb)
    b = np.random.random(nb)
    return a, b

Running it in the interpreter yields (after a few runs to warm up the cache) : 在解释器中运行它（在几次运行后预热缓存）：

In [14]: %timeit -n 100 test_memoryview(a, b)
100 loops, best of 3: 148 ms per loop

In [15]: %timeit -n 100 test_numpy(a, b)
100 loops, best of 3: 159 ms per loop

In [16]: %timeit -n 100 test_numpyvec(a, b)
100 loops, best of 3: 124 ms per loop

# See answer below :
In [17]: %timeit -n 100 test_raw_pointers(a, b)
100 loops, best of 3: 129 ms per loop

I tried it with different dataset sizes, and consistently had the vectorized NumPy function run faster than the compiled Cython code, while I was expecting Cython to be on par with vectorized NumPy in terms of performance. 我尝试使用不同的数据集大小，并且一致地使矢量化NumPy函数比编译的Cython代码运行得更快，而我期望Cython在矢量化NumPy方面与性能相当。

Did I forget an optimization in my Cython code? 我忘记了我的Cython代码中的优化吗？ Does NumPy use something (BLAS?) in order to make such simple operations run faster? NumPy是否会使用某些东西（BLAS？）来使这些简单的操作运行得更快？ Can I improve the performance of this code? 我可以提高此代码的性能吗？

Update: The raw pointer version seems to be on par with NumPy. 更新：原始指针版本似乎与NumPy相同。 So apparently there's some overhead in using memory view or NumPy indexing. 显然，使用内存视图或NumPy索引会有一些开销。

Answer 1

Another option is to use raw pointers (and the global directives to avoid repeating @cython... ): 另一个选择是使用原始指针（以及全局指令以避免重复@cython... ）：

#cython: wraparound=False
#cython: boundscheck=False
#cython: nonecheck=False

#...

cdef ctest_raw_pointers(int n, double *a, double *b):
    cdef int i
    for i in range(n):
        a[i] += b[i]

def test_raw_pointers(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
    ctest_raw_pointers(a.shape[0], &a[0], &b[0])

Answer 2

On my machine the difference isn't as large, but I can nearly eliminate it by changing the numpy and memory view functions like this 在我的机器上，差异并不大，但我几乎可以通过更改像这样的numpy和内存视图函数来消除它

@cython.boundscheck(False)
@cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
    cdef int i, n=a.shape[0]
    for i in range(n):
        a[i] += b[i]

@cython.boundscheck(False)
@cython.wraparound(False)
def test_numpy(np.ndarray[double] a, np.ndarray[double] b):
    cdef int i, n=a.shape[0]
    for i in range(n):
        a[i] += b[i]

and then, when I compile the C output from Cython, I use the flags -O3 and -march=native . 然后，当我从Cython编译C输出时，我使用标志-O3和-march=native 。 This seems to indicate that the difference in timings comes from the use of different compiler optimizations. 这似乎表明时序的差异来自于使用不同的编译器优化。

I use the 64 bit version of MinGW and NumPy 1.8.1. 我使用64位版本的MinGW和NumPy 1.8.1。 Your results will probably vary depending on your package versions, hardware, platform, and compiler. 您的结果可能会因软件包版本，硬件，平台和编译器而异。

If you are using the IPython notebook's Cython magic, you can force an update with the additional compiler flags by replacing %%cython with %%cython -f -c=-O3 -c=-march=native 如果你正在使用IPython笔记本的Cython魔法，你可以用%%cython替换%%cython -f -c=-O3 -c=-march=native

If you are using a standard setup.py for your cython module you can specify the extra_compile_args argument when creating the Extension object that you pass to distutils.setup . 如果您为cython模块使用标准setup.py，则可以在创建传递给distutils.setup的Extension对象时指定extra_compile_args参数。

Note: I removed the ndim=1 flag when specifying the types for the NumPy arrays because it isn't necessary. 注意：在指定NumPy数组的类型时，我删除了ndim=1标志，因为没有必要。 That value defaults to 1 anyway. 无论如何，该值默认为1。

Answer 3

A change that slightly increases the speed is to specify the stride: 稍微提高速度的更改是指定步幅：

def test_memoryview_inorder(double[::1] a, double[::1] b):
    cdef int i
    for i in range(a.shape[0]):
        a[i] += b[i]

为什么Cython比矢量化NumPy慢？

问题描述

3 个解决方案

解决方案1
10 已采纳 2014-06-19 13:58:39

解决方案2
3 2014-06-21 16:32:51

解决方案3
1 2014-06-20 20:30:46

为什么Cython比矢量化NumPy慢？

问题描述

3 个解决方案

解决方案1 10 已采纳 2014-06-19 13:58:39

解决方案2 3 2014-06-21 16:32:51

解决方案3 1 2014-06-20 20:30:46

解决方案1
10 已采纳 2014-06-19 13:58:39

解决方案2
3 2014-06-21 16:32:51

解决方案3
1 2014-06-20 20:30:46