Why is Cython slower than vectorized NumPy?

Question

Consider the following Cython code :

cimport cython
cimport numpy as np
import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
    cdef int i
    for i in range(a.shape[0]):
        a[i] += b[i]

@cython.boundscheck(False)
@cython.wraparound(False)
def test_numpy(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
    cdef int i
    for i in range(a.shape[0]):
        a[i] += b[i]

def test_numpyvec(a, b):
    a += b

def gendata(nb=40000000):
    a = np.random.random(nb)
    b = np.random.random(nb)
    return a, b

Running it in the interpreter yields (after a few runs to warm up the cache) :

In [14]: %timeit -n 100 test_memoryview(a, b)
100 loops, best of 3: 148 ms per loop

In [15]: %timeit -n 100 test_numpy(a, b)
100 loops, best of 3: 159 ms per loop

In [16]: %timeit -n 100 test_numpyvec(a, b)
100 loops, best of 3: 124 ms per loop

# See answer below :
In [17]: %timeit -n 100 test_raw_pointers(a, b)
100 loops, best of 3: 129 ms per loop

I tried it with different dataset sizes, and consistently had the vectorized NumPy function run faster than the compiled Cython code, while I was expecting Cython to be on par with vectorized NumPy in terms of performance.

Did I forget an optimization in my Cython code? Does NumPy use something (BLAS?) in order to make such simple operations run faster? Can I improve the performance of this code?

Update: The raw pointer version seems to be on par with NumPy. So apparently there's some overhead in using memory view or NumPy indexing.

Answer 1

Another option is to use raw pointers (and the global directives to avoid repeating @cython... ):

#cython: wraparound=False
#cython: boundscheck=False
#cython: nonecheck=False

#...

cdef ctest_raw_pointers(int n, double *a, double *b):
    cdef int i
    for i in range(n):
        a[i] += b[i]

def test_raw_pointers(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
    ctest_raw_pointers(a.shape[0], &a[0], &b[0])

Answer 2

On my machine the difference isn't as large, but I can nearly eliminate it by changing the numpy and memory view functions like this

@cython.boundscheck(False)
@cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
    cdef int i, n=a.shape[0]
    for i in range(n):
        a[i] += b[i]

@cython.boundscheck(False)
@cython.wraparound(False)
def test_numpy(np.ndarray[double] a, np.ndarray[double] b):
    cdef int i, n=a.shape[0]
    for i in range(n):
        a[i] += b[i]

and then, when I compile the C output from Cython, I use the flags -O3 and -march=native . This seems to indicate that the difference in timings comes from the use of different compiler optimizations.

I use the 64 bit version of MinGW and NumPy 1.8.1. Your results will probably vary depending on your package versions, hardware, platform, and compiler.

If you are using the IPython notebook's Cython magic, you can force an update with the additional compiler flags by replacing %%cython with %%cython -f -c=-O3 -c=-march=native

If you are using a standard setup.py for your cython module you can specify the extra_compile_args argument when creating the Extension object that you pass to distutils.setup .

Note: I removed the ndim=1 flag when specifying the types for the NumPy arrays because it isn't necessary. That value defaults to 1 anyway.

Answer 3

A change that slightly increases the speed is to specify the stride:

def test_memoryview_inorder(double[::1] a, double[::1] b):
    cdef int i
    for i in range(a.shape[0]):
        a[i] += b[i]

Why is Cython slower than vectorized NumPy?

Question

3 answers

solution1
10 ACCPTED 2014-06-19 13:58:39

solution2
3 2014-06-21 16:32:51

solution3
1 2014-06-20 20:30:46

Why is Cython slower than vectorized NumPy?

Question

3 answers

solution1 10 ACCPTED 2014-06-19 13:58:39

solution2 3 2014-06-21 16:32:51

solution3 1 2014-06-20 20:30:46

solution1
10 ACCPTED 2014-06-19 13:58:39

solution2
3 2014-06-21 16:32:51

solution3
1 2014-06-20 20:30:46