简体   繁体   中英

Parallelize python loop numpy.searchsorted using cython

I've coded a function using cython containing the following loop. Each row of array A1 is binary searched for all values in array A2. So each loop iteration returns a 2D array of index values. Arrays A1 and A2 enter as function arguments, properly typed.

The array C is pre-allocated at highest indent level as required in cython.

I simplified things a little for this question.

...
cdef np.ndarray[DTYPEint_t, ndim=3] C = np.zeros([N,M,M], dtype=DTYPEint)

for j in range(0,N):
    C[j,:,:]  = np.searchsorted(A1[j,:], A2, side='left' )

All's fine so far, things compile and run as expected. However, to gain even more speed I want to parallelize the j-loop. First attempt was simply writing

for j in prange(0,N, nogil=True):
    C[j,:,:]  = np.searchsorted(A1[j,:], A2, side='left' )

I tried many coding variations such as putting things in a separate nogil_function, assigning the result to an intermediate array and write a nested loop to avoid the assignment to the sliced part of C.

Errors usually are of the form "Accessing Python attribute not allowed without gil"

I can't get it to work. Any suggestions on how I can do this?

EDIT:

This is my setup.py

try:
    from setuptools import setup
    from setuptools import Extension
except ImportError:
    from distutils.core import setup
    from distutils.extension import Extension


from Cython.Build import cythonize

import numpy

extensions = [Extension("matchOnDistanceVectors",
                    sources=["matchOnDistanceVectors.pyx"],
                    extra_compile_args=["/openmp", "/O2"],
                    extra_link_args=[]
                   )]


setup(
ext_modules = cythonize(extensions),
include_dirs=[numpy.get_include()]
)

I'm on windows 7 compiling with msvc. I did specify the /openmp flag, my arrays are of sizes 200*200. So everything seems in order...

I believe that searchsorted releases the GIL itself (see https://github.com/numpy/numpy/blob/e2805398f9a63b825f4a2aab22e9f169ff65aae9/numpy/core/src/multiarray/item_selection.c , line 1664 " NPY_BEGIN_THREADS_DEF ").

Therefore, you can do

for j in prange(0,N, nogil=True):
    with gil:
      C[j,:,:]  = np.searchsorted(A1[j,:], A2, side='left' )

That temporarily claims back the GIL to do the necessary work on Python objects (which is hopefully quick), and then it should be released again inside searchsorted allowing in the run largely in parallel.


To update I did a quick test of this ( A1.shape==(105,100) , A2.shape==(302,302) , numbers are chosen pretty arbitrarily). For 10 repeats the serial version took 4.5 second, the parallel version took 1.4 seconds (test was run on a 4 core CPU). You don't get the 4x full speed-up, but you get close.

This was compiled asdescribed in the documentation . I suspect if you aren't seeing speed-up then it could be any of: 1) your arrays are small enough that the function-call/numpy checking types and sizes overhead is dominating; 2) You aren't compiling it with OpenMP enabled; or 3) your compiler doesn't support OpenMP.

You have a bit of a catch 22. You need the GIL to call numpy.searchsorted but the GIL prevents any kind of parallel processing. Your best bet is to write your own nogil version of searchsorted :

cdef mySearchSorted(double[:] array, double target) nogil:
    # binary search implementation

for j in prange(0,N, nogil=True):
    for k in range(A2.shape[0]):
        for L in range(A2.shape[1]):
            C[j, k, L]  = mySearchSorted(A1[j, :], A2[k, L])

numpy.searchsorted also has a non trivial amount of overhead, so if N is large it makes sense to use your own searchsorted just to reduce the overhead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM