简体   繁体   English

与cython的Python快速余弦距离

[英]Python fast cosine distance with cython

I want to speed up the cosine distance calculation scipy.spatial.distance.cosine as much as possible so I tried to use numpy 我想尽可能加快余弦距离计算scipy.spatial.distance.cosine ,所以我尝试使用numpy

def alt_cosine(x,y):
    return 1 - np.inner(x,y)/np.sqrt(np.dot(x,x)*np.dot(y,y))

I tried cython 我试过cython

from libc.math cimport sqrt
def alt_cosine_2(x,y):
    return 1 - np.inner(x,y)/sqrt(np.dot(x,x)*np.dot(y,y))

and get improvements gradually (test on numpy arrays with length 50) 并逐渐得到改进(测试长度为50的numpy数组)

>>> cosine() # ... make some timings
5.27526156300155e-05 # mean calculation time for one loop

>>> alt_cosine() 
9.913400815003115e-06

>>> alt_cosine_2()
7.0269494536660205e-06

What is the fastest way to do this? 最快的方法是什么? Unfortunately, I was not able to specify variable types to alt_cosine_2 , I will use this function with numpy arrays with type np.float32 不幸的是,我无法为alt_cosine_2指定变量类型,我将使用此函数与类型为np.float32 numpy数组

There is a belief, that numpy's functionality cannot be sped up with help of cython or numba. 有一种观点认为,在cython或numba的帮助下,numpy的功能无法加速。 But this is not entirely true: numpy's goal is to offer great performance for a wide range of scenarios, but this also means a somewhat less than perfect performance for special scenarios. 但这并不完全正确:numpy的目标是为各种场景提供出色的性能,但这也意味着对于特殊场景而言,性能稍差。

With a particular scenario at hand, you have a chance to improve on numpy's performance, even if it means to rewrite some of numpy's functionality. 有了特定的场景,你就有机会改善numpy的性能,即使它意味着重写一些numpy的功能。 For example in this case we can accelerate the function by factor 4 using cython and factor 8 using numba. 例如,在这种情况下,我们可以使用numba使用cython和factor 8将因子加速4倍。

Let's start with your versions as baseline (see listings at the end of the answer): 让我们从您的版本开始作为基线(请参阅答案末尾的列表):

>>>%timeit cosine(x,y)   # scipy's
31.9 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>>%timeit np_cosine(x,y)  # your numpy-version
4.05 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np_cosine_fhtmitchell(x,y)  # @FHTmitchell's version
4 µs ± 53.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>>%timeit np_cy_cosine(x,y)
2.56 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So I cannot see the improvement of @FHTmitchell's version, but otherwise not different from your timings. 所以我看不到@ FHTmitchell版本的改进,但与你的时间没有什么不同。

Your vectors have only 50 elements, so there are somewhere around 200-300 ns needed for the real-calculation: everything else is overhead of calling functions. 你的向量只有50个元素,因此实际计算需要大约200-300 ns:其他一切都是调用函数的开销。 One possibility to reduce the overhead is to "inline" these functions per hand with help of cython: 减少开销的一种可能性是在cython的帮助下,每手“内联”这些函数:

%%cython 
from libc.math cimport sqrt
import numpy as np
cimport numpy as np

def cy_cosine(np.ndarray[np.float64_t] x, np.ndarray[np.float64_t] y):
    cdef double xx=0.0
    cdef double yy=0.0
    cdef double xy=0.0
    cdef Py_ssize_t i
    for i in range(len(x)):
        xx+=x[i]*x[i]
        yy+=y[i]*y[i]
        xy+=x[i]*y[i]
    return 1.0-xy/sqrt(xx*yy)

which leads to: 这导致:

>>> %timeit cy_cosine(x,y)
921 ns ± 19.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Not bad! 不错! We can try to squeeze out even more performance by letting go of some safety (run time checks + ieee-754 standard) by making following changes: 我们可以通过进行以下更改来尝试通过放弃一些安全性(运行时检查+ ieee-754标准)来挤出更多性能:

%%cython  -c=-ffast-math
...

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_cosine_perf(np.ndarray[np.float64_t] x, np.ndarray[np.float64_t] y):
    ...

which leads to: 这导致:

>>> %timeit cy_cosine_perf(x,y)
828 ns ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

ie another 10%, which means almost factor 5 faster than the numpy-version. 即另外10%,这意味着比numpy版本快5倍。

There is another tool which offers similar functionality/performance - numba: 还有另一种提供类似功能/性能的工具 - numba:

import numba as nb
import numpy as np
@nb.jit(nopython=True, fastmath=True)
def nb_cosine(x, y):
    xx,yy,xy=0.0,0.0,0.0
    for i in range(len(x)):
        xx+=x[i]*x[i]
        yy+=y[i]*y[i]
        xy+=x[i]*y[i]
    return 1.0-xy/np.sqrt(xx*yy)

which leads to: 这导致:

>>> %timeit nb_cosine(x,y)
495 ns ± 5.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

a speed-up 8 compared to the original numpy version. 与原来的numpy版本相比,加速8。

There are some reasons why numba can be faster: Cython handles the stride of the data during the run time which prevents some optimization (eg vectorization). numba可以更快的原因有一些:Cython在运行时处理数据的步幅,这阻止了一些优化 (例如矢量化)。 Numba seems to handle it better. Numba似乎更好地处理它。

But here the difference is entirely due to less overhead for numba: 但这里的区别完全是因为numba的开销较少:

%%cython  -c=-ffast-math
import numpy as np
cimport numpy as np

def cy_empty(np.ndarray[np.float64_t] x, np.ndarray[np.float64_t] y):
    return x[0]*y[0]

import numba as nb
import numpy as np
@nb.jit(nopython=True, fastmath=True)
def nb_empty(x, y):
    return x[0]*y[0]

%timeit cy_empty(x,y)
753 ns ± 6.81 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit nb_empty(x,y)
456 ns ± 2.47 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

almost factor 2 less overhead for numba! numba的开销几乎减少了2倍!

As @max9111 has pointed out, numpy inlines other jitted functions but also it is able to call some numpy functions with very little overhead, so the following version (replacing inner with dot ): 正如@ max9111指出的那样,numpy内联其他jitted函数,但它能够以非常小的开销调用一些numpy函数,所以下面的版本(用dot替换inner ):

@nb.jit(nopython=True, fastmath=True)
def np_nb_cosine(x,y):
    return 1 - np.dot(x,y)/sqrt(np.dot(x,x)*np.dot(y,y))

>>> %timeit np_nb_cosine(x,y)
605 ns ± 5.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 

only about 10% slower. 只慢约10%。


Please be aware that the above comparison is only valid for vectors with 50 elements. 请注意,上述比较仅适用于包含50个元素的向量。 For more elements, the situation is be completely different: the numpy-version uses parallelized mkl (or similar) implementation of the dot-product and will beat our simple tries easily. 对于更多元素,情况完全不同:numpy-version使用dot-product的并行化mkl(或类似)实现,并且将轻松击败我们的简单尝试。

This begs the question: is it really worth optimizing the code for a special size of the input? 这引出了一个问题:是否真的值得为特殊大小的输入优化代码? Sometimes the answer is "yes" and sometimes the answer is "no". 有时答案是肯定的,有时答案是“不”。

If it is possible I would got with numba + dot solution, which is very fast for small inputs but has also the full power of mkl-implementation for larger inputs. 如果有可能,我会使用numba + dot解决方案,这对于小输入非常快,但对于更大的输入也具有mkl实现的全部功能。


There is also a slight difference: the first versions return a np.float64 -object and the cython and numba versions a Python-float. 还有一点不同:第一个版本返回一个np.float64 ,cython和numba版本返回一个Python-float。


Listings: 人数:

from scipy.spatial.distance import cosine
import numpy as np
x=np.arange(50, dtype=np.float64)
y=np.arange(50,100, dtype=np.float64)

def np_cosine(x,y):
    return 1 - inner(x,y)/sqrt(np.dot(x,x)*dot(y,y))

from numpy import inner, sqrt, dot
def np_cosine_fhtmitchell(x,y):
    return 1 - inner(x,y)/sqrt(np.dot(x,x)*dot(y,y))

%%cython
from libc.math cimport sqrt
import numpy as np
def np_cy_cosine(x,y):
    return 1 - np.inner(x,y)/sqrt(np.dot(x,x)*np.dot(y,y))

Of the lazy ways of speeding this kind of code: 加速这种代码的懒惰方式:

  1. using the numexpr Python module 使用numexpr Python模块
  2. using the numba Python module 使用numba Python模块
  3. using SciPy equivalents of NumPy functions 使用SciPy等效的NumPy函数

Unfortunately, none of these tricks would work for you because: 不幸的是,这些技巧都不适合你,因为:

  1. dot and inner are not implemented in numexpr dotinnernumexpr中没有实现
  2. numba (like Cython) would not speed up call to NumPy's functions numba (比如Cython)不会加快对NumPy函数的调用
  3. dot and inner are not implemented differently in scipy (they are not even available in the namespace). dotinnerscipy实现方式不同(它们在命名空间中甚至不可用)。

Perhaps your best bet is to try to compile numpy under different underlying LA libs (eg LAPACK, BLAS, OpenBLAS, etc.) and compilation options (eg multithreading and the like) to see which combination would prove the most effective for your use case. 也许你最好的办法是尝试在不同的底层LA库(例如LAPACK,BLAS,OpenBLAS等)和编译选项(例如多线程等)下编译numpy ,看看哪种组合对你的用例最有效。

Good luck! 祝好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM