简体   繁体   English

在 R 中找到最高的余弦相似度

[英]Find highest Cosine Similarity in R

I have computed the cosine similarity of tweets, which I have already put in my_matrix .我已经计算了推文的余弦相似度,我已经将其放入my_matrix中。 Now I want to get the highest similarity scores.现在我想获得最高的相似度分数。

cos = cosine(my_matrix)
cos

cos gives me a matrix array with all the values in it. cos给了我一个包含所有值的矩阵数组。 The output looks like this: output 看起来像这样:

           1         2         3         4         5         6         7         8
1  1.0000000  0.5568073  0.3901539  0.5621206  0.2816833  0.2160066  0.2605051  0.2115766
2  0.5568073  1.0000000  0.6526458  0.7140950  0.4307470  0.3033117  0.2941557  0.3437280
3  0.3901539  0.6526458  1.0000000  0.5650099  0.3252116  0.2494666  0.2453746  0.3903765
4  0.5621206  0.7140950  0.5650099  1.0000000  0.4033797  0.2911018  0.3459270  0.3239339
5  0.2816833  0.4307470  0.3252116  0.4033797  1.0000000  0.2501818  0.1925585  0.1905618
6  0.2160066  0.3033117  0.2494666  0.2911018  0.2501818  1.0000000  0.1378479  0.2054312
7  0.2605051  0.2941557  0.2453746  0.3459270  0.1925585  0.1378479  1.0000000  0.1320529
8  0.2115766  0.3437280  0.3903765  0.3239339  0.1905618  0.2054312  0.1320529  1.0000000
9  0.4836184  0.6940823  0.5820808  0.7131646  0.4122365  0.2808218  0.3132991  0.3311042
10 0.3097645  0.3486836  0.2695222  0.3268555  0.1954665  0.1239200  0.1436308  0.1333930

Now I want to iterate through this matrix and get the highest value out of this matrix, except of 1 (because row 1 and column 1 = 1, row 2 and column 2 = 2...).现在我想遍历这个矩阵并从这个矩阵中得到最大值,除了 1(因为第 1 行和第 1 列 = 1,第 2 行和第 2 列 = 2...)。

The output I want to get in this example is 0.7140950 in row 4 and column 2, as it is the second largest value after 1. So far I have tried a double for-loop, to iterate over the rows and columns, but this doesn't work at all and i don't know how to go on.我想在这个例子中得到的0.7140950在第 4 行和第 2 列中是 0.7140950,因为它是 1 之后的第二大值。到目前为止,我已经尝试了一个双 for 循环来迭代行和列,但这并没有根本不工作,我不知道如何打开 go。

biggest_value = 0 

for(row in 1:nrow(party_m)) {
  for(col in 1:ncol(party_m)) {
        if(my_matrix[row, col] > biggest_value ){
           biggest_value = my_matriy[row,col]
        }
  }
}

Does anybody have a solution for this?有人对此有解决方案吗?

diag(cos) <- 0

which(cos == max(cos), arr.ind = TRUE)

Note that since your matrix is symmetric, you'll get the several max values, eg row 4, column 2 and row2, column 4.请注意,由于您的矩阵是对称的,您将获得几个最大值,例如第 4 行第 2 列和第 2 行第 4 列。

You can set the upper triangular to missing first to prevent this:您可以先将上三角设置为缺失以防止出现这种情况:

cos[upper.tri(cos, diag = TRUE)] <- NA

and then use the which function.然后使用which

It's possible your code doesn't work because you have a typo biggest_value = my_matriy[row,col] , instead of biggest_value = my_matrix[row,col] , although I haven't run it to find out.您的代码可能不起作用,因为您有一个拼写错误biggest_value = my_matriy[row,col] ,而不是biggest_value = my_matrix[row,col] ,尽管我没有运行它来找出答案。

As noted in the comments, you can set the diagonal elements of the matrix to be 0, and then determine the maximum value in the matrix.如注释中所述,您可以将矩阵的对角元素设置为 0,然后确定矩阵中的最大值。 You don't have any negative values, but in general, you may prefer to get the maximum absolute value instead / as well, if the strongest association is desired.您没有任何负值,但一般来说,如果需要最强的关联,您可能更愿意获得最大绝对值。 To find which pair yields those values, use ?which .要查找产生这些值的对,请使用?which Consider:考虑:

diag(cos) <- 0 
max(cos)
# [1] 0.714095
which(cos==max(cos), arr.ind=TRUE) 
#      row col
# [1,]   4   2
# [2,]   2   4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM