[英]Vectorized cosine similarity calculation in Python
I have two large sets of vectors, A
and B
. 我有两组大的矢量,
A
和B
Each element of A
is a 1-dimensional vector of length 400, with float values between -10 and 10. For each vector in A
, I'm trying to calculate the cosine similarities to all vectors in B
in order to find the top 5 vectors in B
that best match the given A
vector. A
每个元素都是长度为400的1维向量,浮点值在-10到10之间。对于A
每个向量,我试图计算B
中所有向量的余弦相似度,以便找到前5个B
中最符合给定A
向量的向量。 For now I'm looping over all of A
, and looping over all of B
, calculating the cosine similarities one-by-one with SciPy's spatial.distance.cosine(a, b)
. 现在我循环遍历所有
A
,并循环遍历所有B
,用SciPy的spatial.distance.cosine(a, b)
逐个计算余弦相似度。 Is there a faster way to do this? 有更快的方法吗? Perhaps with matrices?
也许有矩阵?
You can transform each vector first in its unit-vector (by dividing it through its length). 您可以首先在其单位向量中转换每个向量(通过将其除以其长度)。 Then the distance formular simplifies to
然后距离公式简化为
d = 1 - e_v * e_w
with e_v = v / ||v||_2 , e_w = w / ||v||_2
which is faster to calculate. 这更快计算。
Probably more faster is to use scipy.spatial.distance.cdist(XA, XB, 'cosine')
. 可能更快的是使用
scipy.spatial.distance.cdist(XA, XB, 'cosine')
。 You need to build a matrix from the sets of vectors (pseudo-code): 您需要从向量集(伪代码)构建矩阵:
XA=np.array([vecA1,vecA2,...,vecA400])
XB=np.array([vecB1,vecB2,...,vecB400])
distances = scipy.spatial.distance.cdist(XA, XB, 'cosine')
This is a NAIVE no loop, no overhead(?) implementation of what you need... 这是一个NAIVE无循环,没有你需要的开销(?)实现...
from np.linalg import norm
res = 1 - np.dot(A/norm(A, axis=1)[...,None],(B/norm(B,axis=1)[...,None]).T)
Could you please benchmark it on a subset of your data and let us know if it's faster than scipy's cosine distance? 你能否根据你的数据子集对它进行基准测试,让我们知道它是否比scipy的余弦距离更快?
ps, axis=1
above is based on the assumption that your vectors are stored row wise, ps,
axis=1
以上是基于您的向量按行存储的假设,
print A
# [[1 2 3 4 5 6 7 8 ... 400]
# [2 3 4 5 6 7 8 9 ... 401]
etc 等等
In [79]: A = np.random.random((2,5))
In [80]: A
Out[80]:
array([[ 0.2917865 , 0.89617367, 0.27118045, 0.58596817, 0.05154168],
[ 0.61131638, 0.2859271 , 0.09411264, 0.57995386, 0.09829525]])
In [81]: norm(A,axis=1)
Out[81]: array([ 1.14359988, 0.90018201])
In [82]: norm(A,axis=1)[...,None]
Out[82]:
array([[ 1.14359988],
[ 0.90018201]])
In [83]: A/norm(A,axis=1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-83-707fa10dc673> in <module>()
----> 1 A/norm(A,axis=1)
ValueError: operands could not be broadcast together with shapes (2,5) (2,)
In [84]: A/norm(A,axis=1)[...,None]
Out[84]:
array([[ 0.25514737, 0.78364267, 0.23712878, 0.51238915, 0.04506968],
[ 0.67910309, 0.31763254, 0.10454846, 0.64426289, 0.10919486]])
In [85]: norm(A/norm(A,axis=1)[...,None], axis=1)
Out[85]: array([ 1., 1.])
In [86]:
The session above is for explaining the normalisation procedure, when we have the normalised matrices A' and B' we take the dot product (we have to transpose the B' matrix of course) and the result is a matrix whose element j, j
is the dot product of NORMALISED vectors A_i and B_j, we subtract from 1
this matrix and we have a matrix of cosine distances. 上面的会话是为了解释归一化过程,当我们有归一化矩阵A'和B'时我们取点积(当然我们必须转置B'矩阵),结果是一个矩阵,其元素
j, j
是NORMALIZED向量A_i和B_j的点积我们从1
这个矩阵中减去,我们有一个余弦距离矩阵。 Or so I hope... 或者我希望......
In [1]: import numpy as np
In [2]: from numpy.linalg import norm as n
In [3]: from scipy.spatial.distance import cosine
In [4]: A = np.random.random((100,400))
In [5]: B = np.random.random((100,400))
In [6]: C = np.array([[cosine(a,b) for b in B] for a in A])
In [7]: c = 1.0 - np.dot(A/n(A,axis=1)[:,None],(B/n(B,axis=1)[:,None]).T)
In [8]: np.max(C-c)
Out[8]: 8.8817841970012523e-16
In [9]: np.min(C-c)
Out[9]: -8.8817841970012523e-16
In [10]: %timeit [[cosine(a,b) for b in B] for a in A];
1 loops, best of 3: 1.3 s per loop
In [11]: %timeit 1.0 - np.dot(A/n(A,axis=1)[:,None],(B/n(B,axis=1)[:,None]).T)
100 loops, best of 3: 9.28 ms per loop
In [12]:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.