简体   繁体   English

使用 cython 比 struct.pack 更快

[英]Going faster than struct.pack with cython

I'm trying to do better than struct.pack .我试图做得比struct.pack更好。

Taking a specific case of packing integeres, via the answer to this question , I have the following to pack a list of ints in pack_ints.pyx :以打包整数的特定情况为例,通过对这个问题的回答,我有以下内容来打包pack_ints.pyx中的整数列表:

# cython: language_level=3, boundscheck=False
import cython

@cython.boundscheck(False)
@cython.wraparound(False)
def pack_ints(int_col):

    int_buf = bytearray(4*len(int_col))
    cdef int[::1] buf_view = memoryview(int_buf).cast('i')

    idx: int = 0
    for idx in range(len(int_col)):
        buf_view[idx] = int_col[idx]


    return int_buf

With this test code in ipython:在 ipython 中使用此测试代码:

from struct import pack 
import pyximport; pyximport.install(language_level=3) 
import pack_ints 

amount = 10**7 
ints = list(range(amount)) 

res1 = pack(f'{amount}i', *ints) 
res2 = pack_ints.pack_ints(ints) 
assert(res1 == res2) 

%timeit pack(f'{amount}i', *ints)  
%timeit pack_ints.pack_ints(ints)      

I get:我得到:

304 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
212 ms ± 6.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I tried to type int_buf as an array('b') , but didn't see an improvement.我尝试将int_buf键入为array('b') ,但没有看到改进。

Is there any other way to improve upon this, or use cython in a different way, to make this operation faster?有没有其他方法可以改进这一点,或者以不同的方式使用 cython 来使这个操作更快?

This answer tries to give an estimation, how much speed-up a parallelized version can yield.这个答案试图给出一个估计,并行化版本可以产生多少加速。 However, because this task is memory-bandwidth bound (Python-integer-objects take at least 32 bytes and can be scattered in memory, so there will be many cache misses) we should not expect much.但是,因为这个任务是内存带宽限制的(Python 整数对象至少需要 32 个字节并且可以分散在内存中,所以会有很多缓存未命中)我们不应该期望太多。

The first problem is, how to handle errors (element is not an integer or the value is too large).第一个问题是,如何处理错误(元素不是整数或值太大)。 I will be following strategy/simplification: when object我将遵循策略/简化:当反对

  • isn't an integer,不是整数,
  • is negative integer,是负整数,
  • or integer is >=2^30或整数是 >=2^30

it will be casted to a special number ( -1 ) which signals that something went wrong.它将被转换为一个特殊数字 ( -1 ),表示出现问题。 Allowing only non-negative integers <2^30 makes my life easier, as I have to reimplement PyLong_AsLongAndOverflow whitout the raising errors and otherwise detecting overflows is often cumbersome (however, see the version at the end of the answer for a more sophisticated approach).只允许非负整数<2^30使我的生活更轻松,因为我必须重新实现PyLong_AsLongAndOverflow没有引发错误,否则检测溢出通常很麻烦(但是,请参阅答案末尾的版本以获取更复杂的方法) .

The memory-layout of Python's integer object can be found here : Python 整数对象的内存布局可以在这里找到:

struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];
};

Member ob_size /macro Py_SIZE tells us how many 30-bit digits are used in the representation of the integer( ob_size is negative for negative integer). 成员ob_size /macro Py_SIZE告诉我们在整数表示中使用了多少 30 位数字(对于负整数, ob_size为负)。

My simple rule thus translates to the following C-code (I use rather C than Cython, as it is a simpler/more natural way of using Python's C-API):因此,我的简单规则转化为以下 C 代码(我使用 C 而不是 Cython,因为它是使用 Python 的 C-API 的更简单/更自然的方式):

#include <Python.h>

// returns -1 if vv is not an integer,
//            negative, or > 2**30-1
int to_int(PyObject *vv){ 
   if (PyLong_Check(vv)) {
       PyLongObject * v = (PyLongObject *)vv;
       Py_ssize_t i = Py_SIZE(v);
       if(i==0){
           return 0;
       }
       if(i==1){//small enought for a digit
           return v->ob_digit[0];
       }
       //negative (i<0) or too big (i>1)
       return -1;
   }
   return -1;
}

Now given a list, we can convert it to an int -buffer in parallel with the following C-function, which uses omp:现在给定一个列表,我们可以将其转换为int缓冲区,并与以下使用 omp 的 C 函数并行:

void convert_list(PyListObject *lst, int *output){
    Py_ssize_t n = Py_SIZE(lst);
    PyObject **data = lst->ob_item;
    #pragma omp parallel for
    for(Py_ssize_t i=0; i<n; ++i){
        output[i] = to_int(data[i]);
    }
}

There is not much to say - PyListObject -API is used to access the elements of the list in parallel.没什么好说的 - PyListObject用于并行访问列表的元素。 It can be done, because there are no ref counting/racing conditions in to_int -function.这是可以完成的,因为在to_int函数中没有引用计数/竞争条件。

Now, bundling it all together with Cython:现在,将它们与 Cython 捆绑在一起:

%%cython -c=-fopenmp --link-args=-fopenmp
import cython

cdef extern from *:
    """
    #include <Python.h>

    int to_int(PyObject *vv){ 
       ... code
    }

    void convert_list(PyListObject *lst, int *output){
        ... code
    }
    """
    void convert_list(list lst, int *output)

@cython.boundscheck(False)
@cython.wraparound(False)
def pack_ints_ead(list int_col):
    cdef char[::1] int_buf = bytearray(4*len(int_col))
    convert_list(int_col, <int*>(&int_buf[0]))
    return int_buf.base

One important detail is: convert_list must not be nogil (because it isn't)!一个重要的细节是: convert_list不能是 nogil (因为它不是)! Omp threads and Python-threads (which are affected by GIL) are completly different things. Omp 线程和 Python 线程(受 GIL 影响)是完全不同的东西。

One can (but there is no must) release GIL for omp-operations while using objects with buffer-protocol - because those objects get locked via buffer-protocol and cannot be changed from different Python-threads.在使用带有缓冲区协议的对象时,可以(但不是必须)为 omp 操作释放 GIL - 因为这些对象通过缓冲区协议被锁定,并且不能从不同的 Python 线程更改。 A list has no such locking mechanism and thus, if GIL were released, the list could be changed in another threads and all our pointers could get invalidated. list没有这样的锁定机制,因此,如果 GIL 被释放,列表可能会在另一个线程中更改,并且我们所有的指针都可能失效。

So now to timings (with a slightly bigger list):所以现在时间安排(列表稍大):

amount = 5*10**7 
ints = list(range(amount)) 


%timeit pack(f'{amount}i', *ints)  
# 1.51 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pack_ints_DavidW(ints) 
# 284 ms ± 3.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pack_ints_ead(ints) 
# 177 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

btw turning off the parallelization for pack_ints_ead leads to running time of 209 ms.顺便说一句,关闭pack_ints_ead的并行pack_ints_ead会导致运行时间为 209 毫秒。

So given the modest improvement of ca.因此,考虑到 ca 的适度改进。 33%, I would opt for the more robust DavidW's solution. 33%,我会选择更强大的 DavidW 解决方案。


Here is implementation with a slightly different way of signaling wrong values:下面是用稍微不同的方式发出错误值信号的实现:

  • not an integer object results in -2147483648 (ie 0x80000000 ) - the smallest negative value a 32bit-int can store.不是整数对象会导致-2147483648 (即0x80000000 ) - 32bit-int 可以存储的最小负值。
  • integers >=2147483647 (ie >=0x7fffffff ) will be mapped to/stored as 2147483647 - the biggest positive number a 32bit-int can store.整数>=2147483647 (即>=0x7fffffff )将被映射到/存储为2147483647 - 32bit-int 可以存储的最大正数。
  • integers <=-2147483647 (ie <=0x80000001 ) will be mapped to/stored as -2147483647整数<=-2147483647 (即<=0x80000001 )将被映射到/存储为-2147483647
  • all other integer are mapped on their correct value.所有其他整数都映射到它们的正确值。

The main advantage is, that it works correctly for a larger range of integer-values.主要优点是,它适用于更大范围的整数值。 This algorithm yields almost the same running time (maybe 2-3% slower) as the first, simple version:该算法与第一个简单版本产生几乎相同的运行时间(可能慢 2-3%):

int to_int(PyObject *vv){ 
   if (PyLong_Check(vv)) {
       PyLongObject * v = (PyLongObject *)vv;
       Py_ssize_t i = Py_SIZE(v);
       int sign = i<0 ? -1 : 1;
       i = abs(i);
       if(i==0){
           return 0;
       }
       if(i==1){//small enought for a digit
           return sign*v->ob_digit[0];
       }
       if(i==2 && (v->ob_digit[1]>>1)==0){
           int add = (v->ob_digit[1]&1) << 30;
           return sign*(v->ob_digit[0]+add);
       }
       return sign * 0x7fffffff;
   }
   return 0x80000000;
}

When I run my code from the original question I get a ~5 times speed-up.当我从原始问题运行我的代码时,我得到了大约 5 倍的加速。

When I run your code here I see the results you report plus an important warning at the compile stage that I think you're ignoring:当我在这里运行您的代码时,我看到您报告的结果以及我认为您忽略的编译阶段的重要警告

warning: pack_ints.pyx:13:17: Index should be typed for more efficient access

I'm not sure why it isn't picking up the type correctly but to fix it you should change the definition of i back to the code I originally wrote:我不知道为什么它没有正确选择类型,但要修复它,您应该将i的定义更改回我最初编写的代码:

cdef int i
# not "i: int"

Hopefully someone else will come along and try something cleverer, because it's obvious a bit ridiculous that this is an answer.希望其他人会出现并尝试更聪明的方法,因为这是一个答案显然有点荒谬。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM