Cython prange与字符串数组

Question

I'm trying to use prange in order to process multiple strings. 我正在尝试使用prange来处理多个字符串。 As it is not possible to do this with a python list, I'm using a numpy array. 由于无法使用python列表执行此操作，因此我使用的是numpy数组。

With an array of floats, this function works : 对于浮点数组，此函数有效：

from cython.parallel import prange
cimport numpy as np
from numpy cimport ndarray as ar

cpdef func_float(ar[np.float64_t,cast=True] x, double alpha):
    cdef int i
    for i in prange(x.shape[0], nogil=True):
        x[i] = alpha * x[i]
    return x

When I try this simple one : 当我尝试这个简单的方法时：

cpdef func_string(ar[np.str,cast=True] x):
    cdef int i
    for i in prange(x.shape[0], nogil=True):
        x[i] = x[i] + str(i)
    return x

I'm getting this 我得到这个

>> func_string(x = np.array(["apple","pear"],dtype=np.str))
  File "processing.pyx", line 8, in processing.func_string
    cpdef func_string(ar[np.str,cast=True] x):
ValueError: Item size of buffer (20 bytes) does not match size of 'str object' (8 bytes)

I'm probably missing something and I can't find an alternative to str. 我可能缺少了一些东西，但找不到str的替代方法。 Is there a way to properly use prange with an array of string ? 有没有一种方法可以正确地将prange与字符串数组一起使用？

Answer 1

Beside the fact, that your code should fail when cythonized, because you try to create a Python-object (ie str(i) ) without gil, your code isn't doing what you think it should do. 除了事实之外，您的代码在被cythonized处理后也会失败，因为您尝试创建不带gil的Python对象（即str(i) ），因此您的代码没有按照您认为的那样做。

In order to analyse what is going on, let's take a look at a much simple cython-version: 为了分析正在发生的事情，让我们看一下一个非常简单的cython版本：

%%cython -2
cimport numpy as np
from numpy cimport ndarray as ar

cpdef func_string(ar[np.str, cast=True] x):
    print(len(x))

From your error message, one can deduct that you use Python 3 and the Cython-extension is built with (still default) language_level=2 , thus I'm using -2 in the %%cython -magic cell. 从错误消息中可以推断出您使用的是Python 3，而Cython-extension是使用（仍是默认值） language_level=2构建的，因此我在%%cython -magic单元格中使用了-2 。

And now: 现在：

>>> x = np.array(["apple", "pear"], dtype=np.str)
>>> func_string(x)    
ValueError: Item size of buffer (20 bytes) does not match size of 'str object' (8 bytes)

What is going on? 到底是怎么回事？

x is not what you think it is x不是您认为的那样

First, let's take a look at x : 首先，让我们看一下x ：

>>> x.dtype
<U5

So x isn't a collection of unicode-objects. 所以x并不是unicode对象的集合。 An element of x consist of 5 unicode-characters and those elements are stored contiguously in memory, one after another. x一个元素由5个unicode字符组成，这些元素一个接一个地连续存储在内存中。 What is important: The same information as in unicode-objects stored in a different memory layout. 重要说明：与存储在不同内存布局中的Unicode对象中的信息相同。

This is one of numpy's quirks and how np.array works: every element in the list is converted to an unicode-object, than the maximal size of the element is calculated and dtype (in this case <U5 ) is calculated and used. 这是numpy的怪癖之一，也是np.array工作方式：列表中的每个元素都将转换为unicode-object，然后将计算该元素的最大大小并计算并使用dtype（在这种情况下为<U5 ）。

np.str is interpreted differently in cython code ( ar[np.str] x ) (twice!) np.str代码（ ar[np.str] x ）对np.str的解释不同（两次！）

First difference: in your Python3-code np.str is for unicode , but in your cython code, which is cythonized with language_level=2 , np.str is for bytes (see doc ). 第一个区别：在您的Python3代码中， np.str用于unicode ，但是在您的cython代码（使用language_level=2 np.str ）中， np.str则用于bytes （请参阅doc ）。

Second difference: seeing np.str , Cython will interpret it as array with Python-objects (maybe it should be seen as a Cython-bug) - it is almost the same as if dtype were np.object - actually the only difference to np.object are slightly different error messages. 第二个区别：看np.str ，用Cython将它解释为与Python对象（也许它应该被看作是一个用Cython-BUG）阵列-就好像几乎是相同的dtype是np.object -实际上的唯一区别np.object是略有不同的错误消息。

With this information we can understand the error message. 有了这些信息，我们可以了解错误消息。 During the runtime, the input-array is checked (before the first line of the function is executed!): 在运行时，将检查输入数组（在执行函数的第一行之前！）：

expected is an array with python-objects, ie 8-byte pointers, ie array with element size of 8bytes 期望的是带有python对象的数组，即8字节指针，即元素大小为8bytes的数组
received is an array with element size 5*4=20 bytes (one unicode-character is 4 bytes) 收到的元素大小为5 * 4 = 20字节的数组（一个Unicode字符为4字节）

thus the cast cannot be done and the observed exception is thrown. 因此无法完成强制转换，并抛出观察到的异常。

you cannot change the size of an element in an <U.. -numpy-array : 您不能在<U.. -numpy-array中更改元素的大小 ：

Now let's take a look at the following: 现在，让我们看一下以下内容：

>>> x = np.array(["apple", b"pear"], dtype=np.str)
>>> x[0] = x[0]+str(0)
>>> x[0]
'apple'

the element didn't change, because the string x[0]+str(0) was truncated while written back to x -array: there is only place for 5 characters! 元素没有改变，因为字符串x[0]+str(0)在写回x -array时被截断了：只有5个字符的位置！ It would work (to some degree, as long as resulting string has no more than 5 characters) with "pear" though: 但是，使用"pear"可以工作（在某种程度上，只要结果字符串不超过5个字符）：

>>> x[1] = x[1]+str(1)
>>> x[1]
'pear0'

Where does this all leave you? 这一切在哪里离开你？

you probably want to use bytes and not unicodes (ie dtype=np.bytes_ ) 您可能想使用bytes而不是unicodes （即dtype=np.bytes_ ）
given you don't know the element size of your numpy-array at the compile type, you should declare the input-array x as ar x in the signature and roll out the runtime checks, similar as done in the Cython's "depricated" numpy-tutorial . 如果您不知道编译类型下的numpy数组的元素大小，则应在签名中将输入数组x声明为ar x并展开运行时检查，类似于在Cython的“专用” numpy中所做的那样-教程。
if changes should be done in-place, the elements in the input-array should be big enough for the resulting strings. 如果应就地进行更改，则输入数组中的元素应足够大以容纳生成的字符串。

All of the above, has nothing to do with prange . 以上所有与prange 。 To use prange you cannot use str(i) because it operates on python-objects. 要使用prange您不能使用str(i)因为它可以在python对象上运行。

Cython prange与字符串数组

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-03-16 07:21:27

Cython prange与字符串数组

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-03-16 07:21:27

解决方案1
0 已采纳 2019-03-16 07:21:27