简体   繁体   English

键入MemoryView时Cython性能不佳

[英]Poor Cython performance with typed MemoryView

I am trying to speed up some pure-Python code using Cython. 我正在尝试使用Cython加速一些纯Python代码。 Here is the original Python code: 这是原始的Python代码:

import numpy as np
def image_to_mblocks(image_component):
    img_shape = np.shape(image_component)
    v_mblocks = img_shape[0] // 16
    h_mblocks = img_shape[1] // 16
    x = image_component
    x = [x[i * 16:(i + 1) * 16:, j * 16:(j + 1) * 16:] for i in range(v_mblocks) for j in range(h_mblocks)]
    return x

The argument image_component is a 2-dimensional numpy.ndarray , where the length of each dimension is evenly divisible by 16. In pure Python, this function is fast--on my machine, 100 calls with image_component of shape (640, 480) takes 80 ms. 参数image_component是一个二维numpy.ndarray ,其中每个维度的长度可以被16整除。在纯Python中,此函数非常快速-在我的机器上,100个形状为(640, 480) image_component调用80毫秒 However, I need to call this function on the order of thousands to tens of thousands of times, so I am interested in speeding it up. 但是,我需要调用此函数的次数在数千到数万次之间,因此我有兴趣对其进行加速。

Here is my Cython implementation: 这是我的Cython实现:

import numpy as np
cimport numpy as np
cimport cython
ctypedef unsigned char DTYPE_pixel

cpdef np.ndarray[DTYPE_pixel, ndim=3] image_to_mblocks(unsigned char[:, :] image_component):

    cdef int i
    cdef int j
    cdef int k = 0
    cdef int v_mblocks = image_component.shape[0] / 16
    cdef int h_mblocks = image_component.shape[1] / 16
    cdef np.ndarray[DTYPE_pixel, ndim=3] x = np.empty((v_mblocks*h_mblocks, 16, 16), dtype=np.uint8)

    for j in range(h_mblocks):
        for i in range(v_mblocks):
            x[k] = image_component[i * 16:(i + 1) * 16:, j * 16:(j + 1) * 16:]
            k += 1
    return x

The Cython implementation uses a typed MemoryView in order to support slicing of image_component . Cython实现使用类型化的MemoryView来支持对image_component切片。 This Cython implementation takes 250 ms on my machine for 100 iterations (same conditions as before: image_component is a (640, 480) array). 此Cython实现在我的计算机上花费250毫秒进行100次迭代(与以前相同的条件: image_component是一个(640, 480) image_component (640, 480)数组)。

Here is my question: in the example I've given, why does Cython fail to outperform the pure Python implementation? 这是我的问题:在给出的示例中,为什么Cython不能胜过纯Python实现?

I believe I've followed all the steps in the Cython documentation for working with numpy arrays , but I've failed to achieve the performance boost that I was expecting. 我相信我已经按照Cython文档中的所有步骤使用numpy数组 ,但是我未能实现我期望的性能提升。

For reference, here is what my setup.py file looks like: 供参考,这是我的setup.py文件的外观:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy

extensions = [
    Extension('proto_mpeg_computation', ['proto_mpeg_computation.pyx'],
          include_dirs=[numpy.get_include()]
          ),
]

setup(
   name = "proto_mpeg_x",
   ext_modules = cythonize(extensions)
)

The reason you have significantly worse performance is that the Cython version is copying data and the original version is creating references to existing data. 您的性能明显较差的原因是Cython版本正在复制数据,而原始版本正在创建对现有数据的引用。

The line 线

x[i * 16:(i + 1) * 16:, j * 16:(j + 1) * 16:]

creates a view on the original x array (ie if you change x then the view will change too). 在原始x数组上创建一个视图(即,如果更改x则视图也将更改)。 You can confirm this by checking the numpy owndata flag is False on the elements of the array that is returned from your Python function. 您可以通过检查从Python函数返回的数组元素上的numpy owndata标志是否为False来确认这一点。 This operation is very cheap because all it does is store a pointer and some shape/stride information. 该操作非常便宜,因为它所做的只是存储指针和一些形状/步幅信息。

In the Cython version you do 在Cython版本中,您可以

x[k] = image_component[i * 16:(i + 1) * 16:, j * 16:(j + 1) * 16:]

This needs to copy a 16 by 16 array into the memory already allocated for x . 这需要将16 x 16数组复制到已经为x分配的内存中。 It isn't ultra-slow, but there's more work to do than in your original Python version. 它不是很慢,但是要做的工作比原始Python版本还要多。 Again, confirm by checking owndata on the function return value. 同样,通过检查函数返回值上的owndata进行确认。 You should find that it is True . 您应该发现它是True

In your case you should consider whether you want views of the data or copies of the data. 在您的情况下,您应该考虑要查看数据还是数据副本。


This isn't the sort of problem where Cython is going to help much in my view. 在我看来,Cython不会帮助很大。 Cython has some good speed up for indexing individual elements, however when you start to index slices then it behaves the same way as base Python/numpy (which is actually pretty efficient for this type of use). Cython可以很好地加快索引各个元素的速度,但是,当您开始索引切片时,Cython的行为与基本Python / numpy相同(对于这种使用实际上非常有效)。

I suspect you'd get a small gain from putting your original Python code into Cython, and typing image_component as either unsigned char[:, :] or np.ndarray[DTYPE_pixel, ndim=2] . 我怀疑你会从把你原来的Python代码为用Cython,并键入获得小利image_component因为无论是unsigned char[:, :]np.ndarray[DTYPE_pixel, ndim=2] You can also cut out a tiny bit of reference counting by not using x and just returning the list comprehension directly. 您也可以不使用x而是直接返回列表推导,从而减少一点参考计数。 Beyond that I don't see how you can gain much. 除此之外,我看不到如何获得很多收益。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM