与 Numpy 相比优化 Cython 循环

Question

#cython: boundscheck=False, wraparound=False, nonecheck=False, cdivision=True, language_level=3
cpdef int query(double[::1] q, double[:,::1] data) nogil:
    cdef:
        int n = data.shape[0]
        int dim = data.shape[1]
        int best_i = -1
        double best_ip = -1
        double ip
    for i in range(n):
        ip = 0
        for j in range(dim):
            ip += q[j] * data[i, j]
        if ip > best_ip:
            best_i = i
            best_ip = ip
    return best_i

After compiling, I time the code from Python:编译后，我对Python中的代码进行计时：

import numpy as np
import ip
n, dim = 10**6, 10**2
X = np.random.randn(n, dim)
q = np.random.randn(dim)
%timeit ip.query(q, X)

This takes roughly 100ms.这大约需要 100 毫秒。 Meanwhile the equivalent numpy code :同时等效的numpy code ：

%timeit np.argmax(q @ X.T)

Takes just around 50ms.只需大约 50 毫秒。

This is odd, since the NumPy code seemingly has to allocate the big array q @ XT before taking the argmax.这很奇怪，因为NumPy代码似乎必须在获取 argmax 之前分配大数组q @ XT 。 I thus wonder if there are some optimizations I am lacking?因此，我想知道我是否缺少一些优化？

I have added extra_compile_args=["-O3", '-march=native'], to my setup.py and I also tried changing the function definition to我在 setup.py 中添加了extra_compile_args=["-O3", '-march=native'],我还尝试将 function 定义更改为

cpdef int query(np.ndarray[double] q, np.ndarray[double, ndim=2] data):

but it had virtually no difference in performance.但它在性能上几乎没有差异。

Answer 1

The operation q @ XT will be mapped to an implementation of matrix-vector-multiplication ( dgemv ) from either OpenBlas or MKL (depending on your distribution) under the hood - that means you are against one of the best optimized algorithms out there.操作q @ XT将在后台映射到来自 OpenBlas 或 MKL（取决于您的分布）的矩阵向量乘法 ( dgemv ) 的实现——这意味着您反对那里最好的优化算法之一。

The resulting vector has 1M elements, which results in about 8MB memory.生成的向量有 1M 个元素，这导致大约 8MB memory。 8MB will not always fit into L3-cache, but even RAM has about 15GB/s bandwith, thus writing/reading 8MB will take at most 1-2ms - not much gain compared to about 50ms of the overall running time. 8MB 并不总是适合 L3 缓存，但即使是 RAM 也有大约 15GB/s 的带宽，因此写入/读取 8MB 最多需要 1-2ms - 与大约 50ms 的整体运行时间相比没有多少收益。

The most obvios issue with your code, is that it doesn't calculate the same as q @XT .您的代码最明显的问题是它的计算方式与q @XT 。 It calculates它计算

((q[0]*data[i,0]+q[1]*data[i,1])+q[2]*data[i,2])+...

Because of IEEE 754 the compiler is not allowed to reorder the operations and executes them in this non-optimal order: in order to calculate the second sum, the operation must wait until the first summation is performed.由于 IEEE 754，编译器不允许重新排序操作并以这种非最佳顺序执行它们：为了计算第二个总和，该操作必须等到执行第一个总和。 This approach doesn't use the full potential of modern architectures.这种方法没有充分利用现代架构的潜力。

A good dgemv implementation will choose a much better order of operations.一个好的dgemv实现将选择更好的操作顺序。 A similar issue, but with sums, can be found in this SO-post .可以在此SO-post中找到类似的问题，但有总和。

To level the field one could use -ffast-math , which allows compiler to reoder operations and thus make a better use of pipelines.为了平衡该领域，可以使用-ffast-math ，它允许编译器重新编码操作，从而更好地利用管道。

Here are results on my machine for your benchmark:以下是我的机器上用于您的基准测试的结果：

%timeit query(q, X)            # 101 ms
%timeit query_ffastmath(q, X)  # 56.3 ms
%timeit np.argmax(q @ X.T)     # 50.2 ms

It is still about 10% worse, but I would be really surprised if compiler could beat a hand-crafted version created by experts especially for my processor.它仍然差 10% 左右，但如果编译器能够击败专家特别为我的处理器创建的手工制作版本，我会感到非常惊讶。

与 Numpy 相比优化 Cython 循环

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-01-14 21:00:49

与 Numpy 相比优化 Cython 循环

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-01-14 21:00:49

解决方案1
2 已采纳 2021-01-14 21:00:49