简体   繁体   English

Cython prange与字符串数组

[英]Cython prange with an array of string

I'm trying to use prange in order to process multiple strings. 我正在尝试使用prange来处理多个字符串。 As it is not possible to do this with a python list, I'm using a numpy array. 由于无法使用python列表执行此操作,因此我使用的是numpy数组。

With an array of floats, this function works : 对于浮点数组,此函数有效:

from cython.parallel import prange
cimport numpy as np
from numpy cimport ndarray as ar

cpdef func_float(ar[np.float64_t,cast=True] x, double alpha):
    cdef int i
    for i in prange(x.shape[0], nogil=True):
        x[i] = alpha * x[i]
    return x

When I try this simple one : 当我尝试这个简单的方法时:

cpdef func_string(ar[np.str,cast=True] x):
    cdef int i
    for i in prange(x.shape[0], nogil=True):
        x[i] = x[i] + str(i)
    return x

I'm getting this 我得到这个

>> func_string(x = np.array(["apple","pear"],dtype=np.str))
  File "processing.pyx", line 8, in processing.func_string
    cpdef func_string(ar[np.str,cast=True] x):
ValueError: Item size of buffer (20 bytes) does not match size of 'str object' (8 bytes)

I'm probably missing something and I can't find an alternative to str. 我可能缺少了一些东西,但找不到str的替代方法。 Is there a way to properly use prange with an array of string ? 有没有一种方法可以正确地将prange与字符串数组一起使用?

Beside the fact, that your code should fail when cythonized, because you try to create a Python-object (ie str(i) ) without gil, your code isn't doing what you think it should do. 除了事实之外,您的代码在被cythonized处理后也会失败,因为您尝试创建不带gil的Python对象(即str(i) ),因此您的代码没有按照您认为的那样做。

In order to analyse what is going on, let's take a look at a much simple cython-version: 为了分析正在发生的事情,让我们看一下一个非常简单的cython版本:

%%cython -2
cimport numpy as np
from numpy cimport ndarray as ar

cpdef func_string(ar[np.str, cast=True] x):
    print(len(x))

From your error message, one can deduct that you use Python 3 and the Cython-extension is built with (still default) language_level=2 , thus I'm using -2 in the %%cython -magic cell. 从错误消息中可以推断出您使用的是Python 3,而Cython-extension是使用(仍是默认值) language_level=2构建的,因此我在%%cython -magic单元格中使用了-2

And now: 现在:

>>> x = np.array(["apple", "pear"], dtype=np.str)
>>> func_string(x)    
ValueError: Item size of buffer (20 bytes) does not match size of 'str object' (8 bytes)

What is going on? 到底是怎么回事?

x is not what you think it is x不是您认为的那样

First, let's take a look at x : 首先,让我们看一下x

>>> x.dtype
<U5

So x isn't a collection of unicode-objects. 所以x并不是unicode对象的集合。 An element of x consist of 5 unicode-characters and those elements are stored contiguously in memory, one after another. x一个元素由5个unicode字符组成,这些元素一个接一个地连续存储在内存中。 What is important: The same information as in unicode-objects stored in a different memory layout. 重要说明:与存储在不同内存布局中的Unicode对象中的信息相同

This is one of numpy's quirks and how np.array works: every element in the list is converted to an unicode-object, than the maximal size of the element is calculated and dtype (in this case <U5 ) is calculated and used. 这是numpy的怪癖之一,也是np.array工作方式:列表中的每个元素都将转换为unicode-object,然后将计算该元素的最大大小并计算并使用dtype(在这种情况下为<U5 )。

np.str is interpreted differently in cython code ( ar[np.str] x ) (twice!) np.str代码( ar[np.str] x )对np.str的解释不同(两次!)

First difference: in your Python3-code np.str is for unicode , but in your cython code, which is cythonized with language_level=2 , np.str is for bytes (see doc ). 第一个区别:在您的Python3代码中, np.str用于unicode ,但是在您的cython代码(使用language_level=2 np.str )中, np.str则用于bytes (请参阅doc )。

Second difference: seeing np.str , Cython will interpret it as array with Python-objects (maybe it should be seen as a Cython-bug) - it is almost the same as if dtype were np.object - actually the only difference to np.object are slightly different error messages. 第二个区别:看np.str ,用Cython将它解释为与Python对象(也许它应该被看作是一个用Cython-BUG)阵列-就好像几乎是相同的dtypenp.object -实际上的唯一区别np.object是略有不同的错误消息。

With this information we can understand the error message. 有了这些信息,我们可以了解错误消息。 During the runtime, the input-array is checked (before the first line of the function is executed!): 在运行时,将检查输入数组(在执行函数的第一行之前!):

  1. expected is an array with python-objects, ie 8-byte pointers, ie array with element size of 8bytes 期望的是带有python对象的数组,即8字节指针,即元素大小为8bytes的数组
  2. received is an array with element size 5*4=20 bytes (one unicode-character is 4 bytes) 收到的元素大小为5 * 4 = 20字节的数组(一个Unicode字符为4字节)

thus the cast cannot be done and the observed exception is thrown. 因此无法完成强制转换,并抛出观察到的异常。

you cannot change the size of an element in an <U.. -numpy-array : 您不能在<U.. -numpy-array中更改元素的大小

Now let's take a look at the following: 现在,让我们看一下以下内容:

>>> x = np.array(["apple", b"pear"], dtype=np.str)
>>> x[0] = x[0]+str(0)
>>> x[0]
'apple'

the element didn't change, because the string x[0]+str(0) was truncated while written back to x -array: there is only place for 5 characters! 元素没有改变,因为字符串x[0]+str(0)在写回x -array时被截断了:只有5个字符的位置! It would work (to some degree, as long as resulting string has no more than 5 characters) with "pear" though: 但是,使用"pear"可以工作(在某种程度上,只要结果字符串不超过5个字符):

>>> x[1] = x[1]+str(1)
>>> x[1]
'pear0' 

Where does this all leave you? 这一切在哪里离开你?

  • you probably want to use bytes and not unicodes (ie dtype=np.bytes_ ) 您可能想使用bytes而不是unicodes (即dtype=np.bytes_
  • given you don't know the element size of your numpy-array at the compile type, you should declare the input-array x as ar x in the signature and roll out the runtime checks, similar as done in the Cython's "depricated" numpy-tutorial . 如果您不知道编译类型下的numpy数组的元素大小,则应在签名中将输入数组x声明为ar x并展开运行时检查,类似于在Cython的专用” numpy中所做的那样-教程
  • if changes should be done in-place, the elements in the input-array should be big enough for the resulting strings. 如果应就地进行更改,则输入数组中的元素应足够大以容纳生成的字符串。

All of the above, has nothing to do with prange . 以上所有与prange To use prange you cannot use str(i) because it operates on python-objects. 要使用prange您不能使用str(i)因为它可以在python对象上运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM