简体   繁体   中英

cython prange not so fast than single thread

I just wrote a trivial program to test how cython 's prange performs, and here is the code:

from cython.parallel import prange
import numpy as np

def func(int r, int c):
  cdef:
    double[:,:] a = np.arange(r*c, dtype=np.double).reshape(r,c)
    double total = 0
    int i, j

  for i in prange(r, nogil=True, schedule='static', chunksize=1):
    for j in range(c):
      total += a[i,j]

  return total

On Mac Book pro, with OMP_NUM_THREADS=3 , the above code takes almost 18 sec for (r,c) = (10000, 100000) , and with single thread, it takes about 21 sec.

Why there is so little performance boost? Am I using this prange correctly?

Have you timed how long it takes just to allocate a ? A 10000 x 100000 float64 array takes up 8GB of memory.

a = np.ones((10000, 100000), np.double)

takes over six seconds on my laptop with 16GB of RAM. If you don't have 8GB free then you'll hit the swap and it will take a lot longer. Since func spends almost all of its time just allocating a , parallelizing your outer for loop can therefore only gain you a small fractional improvement on the total runtime.

To demonstrate this, I have modified your function to accept a as an input. In tmp.pyx :

#cython: boundscheck=False, wraparound=False, initializedcheck=False

from cython.parallel cimport prange

def serial(double[:, :] a):
    cdef:
        double total = 0
        int i, j
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            total += a[i, j]
    return total

def parallel(double[:, :] a):
    cdef:
        double total = 0
        int i, j
    for i in prange(a.shape[0], nogil=True, schedule='static', chunksize=1):
        for j in range(a.shape[1]):
            total += a[i, j]
    return total

For example:

In [1]: import tmp

In [2]: r, c = 10000, 100000

In [3]: a = np.random.randn(r, c)   # this takes ~6.75 sec

In [4]: %timeit tmp.serial(a)
1 loops, best of 3: 1.25 s per loop

In [5]: %timeit tmp.parallel(a)
1 loops, best of 3: 450 ms per loop

Parallelizing the function gave about a 2.8x speed-up* on my laptop with 4 cores, but this is only a small fraction of the time taken to allocate a .

The lesson here is to always profile your code to understand where it spends its most of its time before you dive into optimizations.


* You could do a little better by passing larger chunks of a to each worker process, eg by increasing chunksize or using schedule='guided'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM