I just wrote a trivial program to test how cython
's prange
performs, and here is the code:
from cython.parallel import prange
import numpy as np
def func(int r, int c):
cdef:
double[:,:] a = np.arange(r*c, dtype=np.double).reshape(r,c)
double total = 0
int i, j
for i in prange(r, nogil=True, schedule='static', chunksize=1):
for j in range(c):
total += a[i,j]
return total
On Mac Book pro, with OMP_NUM_THREADS=3
, the above code takes almost 18 sec for (r,c) = (10000, 100000)
, and with single thread, it takes about 21 sec.
Why there is so little performance boost? Am I using this prange
correctly?
Have you timed how long it takes just to allocate a
? A 10000 x 100000 float64 array takes up 8GB of memory.
a = np.ones((10000, 100000), np.double)
takes over six seconds on my laptop with 16GB of RAM. If you don't have 8GB free then you'll hit the swap and it will take a lot longer. Since func
spends almost all of its time just allocating a
, parallelizing your outer for
loop can therefore only gain you a small fractional improvement on the total runtime.
To demonstrate this, I have modified your function to accept a
as an input. In tmp.pyx
:
#cython: boundscheck=False, wraparound=False, initializedcheck=False
from cython.parallel cimport prange
def serial(double[:, :] a):
cdef:
double total = 0
int i, j
for i in range(a.shape[0]):
for j in range(a.shape[1]):
total += a[i, j]
return total
def parallel(double[:, :] a):
cdef:
double total = 0
int i, j
for i in prange(a.shape[0], nogil=True, schedule='static', chunksize=1):
for j in range(a.shape[1]):
total += a[i, j]
return total
For example:
In [1]: import tmp
In [2]: r, c = 10000, 100000
In [3]: a = np.random.randn(r, c) # this takes ~6.75 sec
In [4]: %timeit tmp.serial(a)
1 loops, best of 3: 1.25 s per loop
In [5]: %timeit tmp.parallel(a)
1 loops, best of 3: 450 ms per loop
Parallelizing the function gave about a 2.8x speed-up* on my laptop with 4 cores, but this is only a small fraction of the time taken to allocate a
.
The lesson here is to always profile your code to understand where it spends its most of its time before you dive into optimizations.
* You could do a little better by passing larger chunks of a
to each worker process, eg by increasing chunksize
or using schedule='guided'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.