简体   繁体   English

Python中的矢量化余弦相似度计算

[英]Vectorized cosine similarity calculation in Python

I have two large sets of vectors, A and B . 我有两组大的矢量, AB Each element of A is a 1-dimensional vector of length 400, with float values between -10 and 10. For each vector in A , I'm trying to calculate the cosine similarities to all vectors in B in order to find the top 5 vectors in B that best match the given A vector. A每个元素都是长度为400的1维向量,浮点值在-10到10之间。对于A每个向量,我试图计算B中所有向量的余弦相似度,以便找到前5个B中最符合给定A向量的向量。 For now I'm looping over all of A , and looping over all of B , calculating the cosine similarities one-by-one with SciPy's spatial.distance.cosine(a, b) . 现在我循环遍历所有A ,并循环遍历所有B ,用SciPy的spatial.distance.cosine(a, b)逐个计算余弦相似度。 Is there a faster way to do this? 有更快的方法吗? Perhaps with matrices? 也许有矩阵?

You can transform each vector first in its unit-vector (by dividing it through its length). 您可以首先在其单位向量中转换每个向量(通过将其除以其长度)。 Then the distance formular simplifies to 然后距离公式简化为

 d = 1 - e_v * e_w

 with e_v = v / ||v||_2 , e_w = w / ||v||_2 

which is faster to calculate. 这更快计算。

Probably more faster is to use scipy.spatial.distance.cdist(XA, XB, 'cosine') . 可能更快的是使用scipy.spatial.distance.cdist(XA, XB, 'cosine') You need to build a matrix from the sets of vectors (pseudo-code): 您需要从向量集(伪代码)构建矩阵:

XA=np.array([vecA1,vecA2,...,vecA400])
XB=np.array([vecB1,vecB2,...,vecB400])
distances = scipy.spatial.distance.cdist(XA, XB, 'cosine')

This is a NAIVE no loop, no overhead(?) implementation of what you need... 这是一个NAIVE无循环,没有你需要的开销(?)实现...

from np.linalg import norm
res = 1 - np.dot(A/norm(A, axis=1)[...,None],(B/norm(B,axis=1)[...,None]).T)

Could you please benchmark it on a subset of your data and let us know if it's faster than scipy's cosine distance? 你能否根据你的数据子集对它进行基准测试,让我们知道它是否比scipy的余弦距离更快?


ps, axis=1 above is based on the assumption that your vectors are stored row wise, ps, axis=1以上是基于您的向量按行存储的假设,

print A
# [[1 2 3 4 5 6 7 8 ... 400]
#  [2 3 4 5 6 7 8 9 ... 401]

etc 等等


Comments 评论

In [79]: A = np.random.random((2,5))

In [80]: A
Out[80]: 
array([[ 0.2917865 ,  0.89617367,  0.27118045,  0.58596817,  0.05154168],
       [ 0.61131638,  0.2859271 ,  0.09411264,  0.57995386,  0.09829525]])

In [81]: norm(A,axis=1)
Out[81]: array([ 1.14359988,  0.90018201])

In [82]: norm(A,axis=1)[...,None]
Out[82]: 
array([[ 1.14359988],
       [ 0.90018201]])

In [83]: A/norm(A,axis=1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-83-707fa10dc673> in <module>()
----> 1 A/norm(A,axis=1)

ValueError: operands could not be broadcast together with shapes (2,5) (2,) 

In [84]: A/norm(A,axis=1)[...,None]
Out[84]: 
array([[ 0.25514737,  0.78364267,  0.23712878,  0.51238915,  0.04506968],
       [ 0.67910309,  0.31763254,  0.10454846,  0.64426289,  0.10919486]])

In [85]: norm(A/norm(A,axis=1)[...,None], axis=1)
Out[85]: array([ 1.,  1.])

In [86]: 

The session above is for explaining the normalisation procedure, when we have the normalised matrices A' and B' we take the dot product (we have to transpose the B' matrix of course) and the result is a matrix whose element j, j is the dot product of NORMALISED vectors A_i and B_j, we subtract from 1 this matrix and we have a matrix of cosine distances. 上面的会话是为了解释归一化过程,当我们有归一化矩阵A'和B'时我们取点积(当然我们必须转置B'矩阵),结果是一个矩阵,其元素j, j是NORMALIZED向量A_i和B_j的点积我们从1这个矩阵中减去,我们有一个余弦距离矩阵。 Or so I hope... 或者我希望......

Test & Benchmark 测试和基准

In [1]: import numpy as np                                              

In [2]: from numpy.linalg import norm as n

In [3]: from scipy.spatial.distance import cosine

In [4]: A = np.random.random((100,400))

In [5]: B = np.random.random((100,400))

In [6]: C = np.array([[cosine(a,b) for b in B] for a in A])

In [7]: c = 1.0 - np.dot(A/n(A,axis=1)[:,None],(B/n(B,axis=1)[:,None]).T)

In [8]: np.max(C-c)
Out[8]: 8.8817841970012523e-16

In [9]: np.min(C-c)
Out[9]: -8.8817841970012523e-16

In [10]: %timeit [[cosine(a,b) for b in B] for a in A];
1 loops, best of 3: 1.3 s per loop

In [11]: %timeit 1.0 - np.dot(A/n(A,axis=1)[:,None],(B/n(B,axis=1)[:,None]).T)
100 loops, best of 3: 9.28 ms per loop

In [12]: 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM