简体   繁体   English

与 Numpy 相比优化 Cython 循环

[英]Optimizing Cython loop compared to Numpy

#cython: boundscheck=False, wraparound=False, nonecheck=False, cdivision=True, language_level=3
cpdef int query(double[::1] q, double[:,::1] data) nogil:
    cdef:
        int n = data.shape[0]
        int dim = data.shape[1]
        int best_i = -1
        double best_ip = -1
        double ip
    for i in range(n):
        ip = 0
        for j in range(dim):
            ip += q[j] * data[i, j]
        if ip > best_ip:
            best_i = i
            best_ip = ip
    return best_i

After compiling, I time the code from Python:编译后,我对Python中的代码进行计时:

import numpy as np
import ip
n, dim = 10**6, 10**2
X = np.random.randn(n, dim)
q = np.random.randn(dim)
%timeit ip.query(q, X)

This takes roughly 100ms.这大约需要 100 毫秒。 Meanwhile the equivalent numpy code :同时等效的numpy code

%timeit np.argmax(q @ X.T)

Takes just around 50ms.只需大约 50 毫秒。

This is odd, since the NumPy code seemingly has to allocate the big array q @ XT before taking the argmax.这很奇怪,因为NumPy代码似乎必须在获取 argmax 之前分配大数组q @ XT I thus wonder if there are some optimizations I am lacking?因此,我想知道我是否缺少一些优化?

I have added extra_compile_args=["-O3", '-march=native'], to my setup.py and I also tried changing the function definition to我在 setup.py 中添加了extra_compile_args=["-O3", '-march=native'],我还尝试将 function 定义更改为

cpdef int query(np.ndarray[double] q, np.ndarray[double, ndim=2] data):

but it had virtually no difference in performance.但它在性能上几乎没有差异。

The operation q @ XT will be mapped to an implementation of matrix-vector-multiplication ( dgemv ) from either OpenBlas or MKL (depending on your distribution) under the hood - that means you are against one of the best optimized algorithms out there.操作q @ XT将在后台映射到来自 OpenBlas 或 MKL(取决于您的分布)的矩阵向量乘法 ( dgemv ) 的实现——这意味着您反对那里最好的优化算法之一。

The resulting vector has 1M elements, which results in about 8MB memory.生成的向量有 1M 个元素,这导致大约 8MB memory。 8MB will not always fit into L3-cache, but even RAM has about 15GB/s bandwith, thus writing/reading 8MB will take at most 1-2ms - not much gain compared to about 50ms of the overall running time. 8MB 并不总是适合 L3 缓存,但即使是 RAM 也有大约 15GB/s 的带宽,因此写入/读取 8MB 最多需要 1-2ms - 与大约 50ms 的整体运行时间相比没有多少收益。

The most obvios issue with your code, is that it doesn't calculate the same as q @XT .您的代码最明显的问题是它的计算方式与q @XT It calculates它计算

((q[0]*data[i,0]+q[1]*data[i,1])+q[2]*data[i,2])+...

Because of IEEE 754 the compiler is not allowed to reorder the operations and executes them in this non-optimal order: in order to calculate the second sum, the operation must wait until the first summation is performed.由于 IEEE 754,编译器不允许重新排序操作并以这种非最佳顺序执行它们:为了计算第二个总和,该操作必须等到执行第一个总和。 This approach doesn't use the full potential of modern architectures.这种方法没有充分利用现代架构的潜力。

A good dgemv implementation will choose a much better order of operations.一个好的dgemv实现将选择更好的操作顺序。 A similar issue, but with sums, can be found in this SO-post .可以在此SO-post中找到类似的问题,但有总和。

To level the field one could use -ffast-math , which allows compiler to reoder operations and thus make a better use of pipelines.为了平衡该领域,可以使用-ffast-math ,它允许编译器重新编码操作,从而更好地利用管道。

Here are results on my machine for your benchmark:以下是我的机器上用于您的基准测试的结果:

%timeit query(q, X)            # 101 ms
%timeit query_ffastmath(q, X)  # 56.3 ms
%timeit np.argmax(q @ X.T)     # 50.2 ms

It is still about 10% worse, but I would be really surprised if compiler could beat a hand-crafted version created by experts especially for my processor.它仍然差 10% 左右,但如果编译器能够击败专家特别为我的处理器创建的手工制作版本,我会感到非常惊讶。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM