简体   繁体   English

在没有 GIL 的情况下在 Cython 中并行化

[英]Parallelize in Cython without GIL

I'm trying to compute some columns of a numpy array, operating on python objects ( numpy array) in a for loop using a cdef function.我正在尝试计算numpy数组的一些列,使用cdef函数在 for 循环中对 python 对象( numpy数组)进行操作。

I would like to do it in parallel.我想同时进行。 But not sure how to.但不确定如何。

Here is a toy example, one def function calls a cdef function in a for loop using prange , which is not allowed because np.ndarray is a python object.这是一个玩具示例,一个def函数使用prange在 for 循环中调用cdef函数,这是不允许的,因为np.ndarray是一个 python 对象。 In my real problem, one matrix and one vector are the arguments of the cdef function, and some numpy matrix operations are performed, like np.linalg.pinv() (which I guess is actually the bottleneck).在我的实际问题中,一个矩阵和一个向量是cdef函数的参数,并且执行了一些numpy矩阵操作,例如np.linalg.pinv() (我猜这实际上是瓶颈)。

%%cython
import numpy as np
cimport numpy as np
from cython.parallel import prange
from c_functions import estimate_coef_linear_regression

DTYPE = np.float
ctypedef np.float_t DTYPE_t

def transpose_example(np.ndarray[DTYPE_t, ndim=2] data):
    """
    Transposes a matrix. It does each row independently and parallel
    """

    cdef Py_ssize_t n = data.shape[0]
    cdef Py_ssize_t t = data.shape[1]

    cdef np.ndarray[DTYPE_t, ndim = 2] results = np.zeros((t, n))

    cdef Py_ssize_t i

    for i in prange(n, nogil=True):
        results[i, :] = transpose_vector(data[:, i])

    return results

cdef transpose_vector(np.ndarray[DTYPE_t, ndim=1] vector):
    """
    transposes a np vector
    """
    return vector.transpose()

a = np.random.rand(100, 20)
transpose_example(a)

outputs产出

Converting to Python object not allowed without gil

What would be the best way to do this in parallel?并行执行此操作的最佳方法是什么?

You can pass typed memoryview slices ( cdef transpose_vector(DTYPE_t[:] vector) ) around without the GIL - it's one of the key advantages of the newer typed memoryview syntax over np.ndarray .您可以在没有 GIL 的情况下传递类型化 memoryview 切片( cdef transpose_vector(DTYPE_t[:] vector) ) - 这是较新类型 memoryview 语法优于np.ndarray的主要优势之一。

However,然而,

  • You can't call Numpy member functions (like transpose) on memoryviews, unless you cast back to a Numpy array ( np.asarray(vector) ).您不能在内存视图上调用 Numpy 成员函数(如转置),除非您转换回 Numpy 数组( np.asarray(vector) )。 This requires the GIL.这需要 GIL。
  • Calling any kind of Python function (eg transpose ) is going to require the GIL.调用任何类型的 Python 函数(例如transpose )都需要 GIL。 This can be done inside a with gil: block, but when that block is almost your entire loop that becomes pretty pointless.这可以在with gil:块中完成,但是当该块几乎是您的整个循环时,就变得毫无意义了。
  • You don't specify a return type for transpose_vector , and so it'll default to object , which requires the GIL.您没有为transpose_vector指定返回类型,因此它将默认为object ,这需要 GIL。 You could specify a Cython return type, but I suspect even returning a memoryview slice may require some reference counting somewhere.您可以指定 Cython 返回类型,但我怀疑即使返回 memoryview 切片也可能需要在某处进行一些引用计数。
  • Be careful not to have multiple threads overwriting the same data in your passed memoryview slice.注意不要让多个线程覆盖传递的 memoryview 切片中的相同数据。

In summary: memoryview slices, but bear in mind you're quite limited in what you can do without the GIL.总结:memoryview slices,但请记住,如果没有 GIL,您可以做的事情非常有限。 Your current example just isn't parallelizable (but this may be mostly because it's a toy example).您当前的示例只是不可并行化的(但这可能主要是因为它是一个玩具示例)。

Q : "What would be the best way to do this in parallel?"“并行执行此操作的最佳方法是什么?”
+ +
" I have used intentionally np.transpose() to show that I have to use a python object. " 我故意使用np.transpose()来表明我必须使用 python 对象。

Let me start with freely paraphrasing the fabulous Henry FORD's maxim: the least defective part of an automobile is the very one, which is not there at all - it can never get damaged.让我从自由地转述亨利福特 (Henry FORD) 的名言开始:汽车中缺陷最少的部分是完全不存在的部分——它永远不会损坏。

Those, who know how the numpy 's internal array-object representation works are sure,那些知道numpy的内部数组对象表示如何工作的人肯定,
it takes almost zero time几乎需要零时间
and
it requires almost zero memory-I/Os ( at least for the last decade or so it did )几乎需要零内存 I/O (至少在过去十年左右确实如此)


WHY ?为什么 ?

Numpy is smart . Numpy 很聪明
It does not move any data for this at all.它根本不会为此移动任何数据。 It just adapts the indexing at a cost of about 15 [us]它只是以大约15 [us]的成本调整索引


ANSWER :回答 :

The Best Ever way to do the np.transpose() does not need any parallelisation at all.执行np.transpose()的最佳方法根本不需要任何并行化。

Any attempt to do so will result in a way poorer performance, due to artificially enforced many useless memory-I/Os that the native np.transpose() never does - it just swaps indexing-scheme, without moving the heaps of cell-data, all rest in their original places (incl. keeping any and all cache-coherences valid - so any next access is going to take place 0.5 ~ 5 [ns] from cache, not having to pay again immense amounts of many times 150-350 [ns] for moving any cell-data from/to physical RAM-locations and devastating the cache-lines)任何这样做的尝试都会导致性能np.transpose() ,因为人为地强制执行了许多无用的内存 I/O,而原生np.transpose()从不这样做——它只是交换索引方案,而不移动单元数据堆,所有这些都保留在它们原来的位置(包括保持任何和所有缓存一致性有效 - 所以任何下一次访问都将从缓存中发生0.5 ~ 5 [ns] ,而不必再次支付150-350 [ns]倍的大量费用150-350 [ns]用于从/向物理 RAM 位置移动任何单元数据并破坏缓存线)


THE PROOF :证据 :

______1x THE PROBLEM SIZE 100 x 20 x 8 [B] ( 16 kB RAM).transpose() ~ 17 [us] ______1x THE PROBLEM SIZE 100 x 20 x 8 [B] ( 16 kB RAM).transpose() ~ 17 [us]

>>> r =   100; c =   20; a = np.arange(r*c).reshape( (r,c) );a.itemsize*a.size/1E6
0.016
>>> aClk.start(); _ = a.transpose(); aClk.stop()  ####  16   [kB] RAM-footprint
17                                                ####  17   [us] !!! ZERO-COPY

____100x THE PROBLEM SIZE 1000 x 200 x 8 [B] (1.6 MB RAM).transpose() ~ 17 [us] ____100x THE PROBLEM SIZE 1000 x 200 x 8 [B] (1.6 MB RAM).transpose() ~ 17 [us]

>>> r =  1000; c =  200; a = np.arange(r*c).reshape( (r,c) );a.itemsize*a.size/1E6
1.6
>>> aClk.start(); _ = a.transpose(); aClk.stop()  ####   1.6 [MB] RAM-footprint
17                                                ####  17   [us] !!! ZERO-COPY

__10000x THE PROBLEM SIZE 10000 x 2000 x 8 [B] (160 MB RAM).transpose() ~ 16 [us] __10000x THE PROBLEM SIZE 10000 x 2000 x 8 [B] (160 MB RAM).transpose() ~ 16 [us]

>>> r = 10000; c = 2000; a = np.arange(r*c).reshape( (r,c) );a.itemsize*a.size/1E6
160.0
>>> aClk.start(); _ = a.transpose(); aClk.stop()  #### 160.0 [MB] RAM-footprint
16                                                ####  16   [us] !!! ZERO-COPY

Moving or ALAP-allocating & copying that many RAM-stored megabytes of cell-data, as does any prange -d or whatever else code would do, will take ages, not the smart 16 [ns] as the smart numpy-design does.移动或 ALAP 分配和复制许多 RAM 存储的兆字节的单元数据,就像任何prange -d 或任何其他代码会做的那样,将需要很prange ,而不是像智能 numpy 设计那样智能16 [ns]


>>> a.flags
  C_CONTIGUOUS    : True <-------------------------ORIGINAL [indirect] RAM-indexing
  F_CONTIGUOUS    : False
  OWNDATA         : False
  WRITEABLE       : True
  ALIGNED         : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY    : False
>>>
>>> _.flags
  C_CONTIGUOUS    : False
  F_CONTIGUOUS    : True <------------------------TRANSPOSE'd
  OWNDATA         : False
  WRITEABLE       : True
  ALIGNED         : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY    : False

QED QED


EPILOGUE :结语:

Now comes the point behind the notice clear and sound.现在是通知背后的重点。 Hopefully that.希望如此。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM