简体   繁体   English

创建一个 function 以仅使用 numpy 计算二维矩阵中行向量的所有成对余弦相似度

[英]create a function to compute all pairwise cosine similarity of the row vectors in a 2-D matrix using only numpy

For example, given matrix例如,给定矩阵

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

it should return它应该返回

array([[1.        , 0.91465912, 0.87845859],
       [0.91465912, 1.        , 0.99663684],
       [0.87845859, 0.99663684, 1.        ]])

where the (i, j) entry of the result is the cosine similarity between the row vector arr[i] and the row vector arr[j] : cos_sim[i, j] == CosSim(arr[i], arr[j]) .其中结果的(i, j)项是行向量arr[i]和行向量arr[j]之间的余弦相似度: cos_sim[i, j] == CosSim(arr[i], arr[j])

As usual, the cosine similarity between two vectors, is defined as:像往常一样,两个向量之间的余弦相似度定义为: 在此处输入图像描述

This function should return a np.ndarray of shape (arr.shape[0], arr.shape[0])这个 function 应该返回一个 np.ndarray 形状 (arr.shape[0], arr.shape[0])

Try:尝试:

from scipy.spatial.distance import cdist

1 - cdist(a, a, metric='cosine')

Output: Output:

array([[1.        , 0.91465912, 0.87845859],
       [0.91465912, 1.        , 0.99663684],
       [0.87845859, 0.99663684, 1.        ]])
a

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

Using the second formula, say pq使用第二个公式,说pq
在此处输入图像描述

p = a / np.linalg.norm(a, 2, axis=1).reshape(-1,1)
p
array([[0.        , 0.18257419, 0.36514837, 0.54772256, 0.73029674],
       [0.31311215, 0.37573457, 0.438357  , 0.50097943, 0.56360186],
       [0.37011661, 0.40712827, 0.44413993, 0.48115159, 0.51816325]])

Note that the norm has to be calculated row wise.请注意,范数必须逐行计算。 And so, we have above axis=1 .所以,我们有上面的axis=1 Also, norms would be rank 1 vector.此外,规范将是 1 级向量。 So, to convert into a shape (3,1) in this case, reshape would be required.因此,在这种情况下,要转换为形状(3,1) ,就需要 reshape。 Also, the above formula is for vector, when you apply to matrix, "the transpose part would be come second".另外,上面的公式是针对向量的,当你应用到矩阵时,“转置部分会排在第二位”。

Now in this case, q is nothing but p iteslf.现在在这种情况下,q 只不过是 piteslf。 So, cosine similarity would be所以,余弦相似度将是

np.dot(p, p.T)
array([[1.        , 0.91465912, 0.87845859],
       [0.91465912, 1.        , 0.99663684],
       [0.87845859, 0.99663684, 1.        ]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM