简体   繁体   English

Python 中的调整余弦相似度

[英]Adjusted Cosine Similarity in Python

Referring to this link参考这个 链接

which calculates adjusted cosine similarity matrix (given the ratings matrix M having m users and n items) as below:它计算调整后的余弦相似度矩阵(给定具有 m 个用户和 n 个项目的评分矩阵 M)如下:

M_u = M.mean(axis=1)    
item_mean_subtracted = M - M_u[:, None]    
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))

I cannot see how the 'both rated' condition is met as per this definition我看不到根据此定义如何满足“两个额定”条件

I have manually calculated the adjusted cosine similarities and they seem to differ with the values I get from above code.我已经手动计算了调整后的余弦相似度,它们似乎与我从上面的代码中得到的值不同。

Could anyone please clarify this?任何人都可以澄清这一点吗?

Let's first try to understand the formulation, the matrix is stored such that each row is a user and each column is an item.让我们首先尝试理解公式,矩阵的存储使得每行是一个用户,每列是一个项目。 User is indexed by u and column is indexed by i.用户由 u 索引,列由 i 索引。

Each user have different judgement rule of how good or how bad is something is.每个用户对事物的好坏都有不同的判断规则。 A 1 from a user could be a 3 from another user.一个用户的 1 可能是另一个用户的 3。 That is why we subtract the average of each R_u, from each R_{u,i}.这就是为什么我们从每个 R_{u,i} 中减去每个 R_u 的平均值。 This is computed as item_mean_subtracted in your code.这在您的代码中计算为 item_mean_subtracted。 Notice that we are subtracting each element by its row mean to normalize the user's biasness.请注意,我们将每个元素减去其行均值以标准化用户的偏见。 After which, we normalized each column (item) by dividing each column by its norm and then compute the cosine similarity between each column.之后,我们通过将每列除以其范数来规范化每列(项目),然后计算每列之间的余弦相似度。

pdist(item_mean_subtracted.T, 'cosine') computes the cosine distance between the items and it is known that pdist(item_mean_subtracted.T, 'cosine') 计算项目之间的余弦距离,并且已知

cosine similarity = 1- cosine distance余弦相似度 = 1- 余弦距离

and hence that is why the code works.因此这就是代码有效的原因。

Now, what if I compute it directly according to the definition directly?现在,如果我直接根据定义直接计算呢? I have commented what is being performed in each step, try to copy and paste the code and you can compare with your calculation by printing out more intermediate steps.我已经评论了每个步骤中正在执行的操作,尝试复制和粘贴代码,您可以通过打印出更多中间步骤来与您的计算进行比较。

import numpy as np
from scipy.spatial.distance import pdist, squareform
from numpy.linalg import norm

M = np.asarray([[2, 3, 4, 1, 0], 
                [0, 0, 0, 0, 5], 
                [5, 4, 3, 0, 0], 
                [1, 1, 1, 1, 1]])

M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
print(similarity_matrix)

#Computing the cosine similarity directly
n = len(M[0]) # find out number of columns(items)
normalized = item_mean_subtracted/norm(item_mean_subtracted, axis = 0).reshape(1,n) #divide each column by its norm, normalize it
normalized = normalized.T #transpose it
similarity_matrix2 = np.asarray([[np.inner(normalized[i],normalized[j] ) for i in range(n)] for j in range(n)]) # compute the similarity matrix by taking inner product of any two items
print(similarity_matrix2)

Both of the codes give the same result:这两个代码给出了相同的结果:

[[ 1.          0.86743396  0.39694169 -0.67525773 -0.72426278]
 [ 0.86743396  1.          0.80099604 -0.64553225 -0.90790362]
 [ 0.39694169  0.80099604  1.         -0.37833504 -0.80337196]
 [-0.67525773 -0.64553225 -0.37833504  1.          0.26594024]
 [-0.72426278 -0.90790362 -0.80337196  0.26594024  1.        ]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM