简体   繁体   English

为什么两个向量之间的余弦相似度可以为负?

[英]Why can cosine similarity between two vectors be negative?

I have 2 vectors with 11 dimentions.我有 2 个 11 维的向量。

a <- c(-0.012813841, -0.024518383, -0.002765056,  0.079496744,  0.063928973,
        0.476156960,  0.122111977,  0.322930189,  0.400701256,  0.454048860,
        0.525526219)

b <- c(0.64175768,  0.54625694,  0.40728261,  0.24819750,  0.09406221, 
       0.16681692, -0.04211932, -0.07130129, -0.08182200, -0.08266852,
       -0.07215885)

cosine_sim <- cosine(a,b)

which returns:返回:

-0.05397935

I used cosine() from lsa package.我使用了lsa package 的cosine()

for some values i am getting negative cosine_sim like the given one.对于某些值,我会像给定的那样得到负 cosine_sim。 I am not sure how the similarity can be negative.我不确定相似性如何可能是负面的。 It should be between 0 and 1.它应该在 0 和 1 之间。

Can anyone explain what is going on here.谁能解释这里发生了什么。

The nice thing about R is that you can often dig into the functions and see for yourself what is going on. R 的好处是您可以经常深入研究函数并亲自查看发生了什么。 If you type cosine (without any parentheses, arguments, etc.) then R prints out the body of the function.如果您键入cosine (不带任何括号,arguments 等),则 R 会打印出 function 的主体。 Poking through it (which takes some practice), you can see that there is a bunch of machinery for computing the pairwise similarities of the columns of the matrix (ie, the bit wrapped in the if (is.matrix(x) && is.null(y)) condition, but the key line of the function is通过它(需要一些练习),您可以看到有一堆机器用于计算矩阵列的成对相似性(即,包裹在if (is.matrix(x) && is.null(y))条件,但 function 的关键行是

crossprod(x, y)/sqrt(crossprod(x) * crossprod(y))

Let's pull this out and apply it to your example:让我们将其提取出来并将其应用于您的示例:

> crossprod(a,b)/sqrt(crossprod(a)*crossprod(b))
            [,1]
[1,] -0.05397935
> crossprod(a)
     [,1]
[1,]    1
> crossprod(b)
     [,1]
[1,]    1

So, you're using vectors that are already normalized, so you just have crossprod to look at.因此,您使用的是已经标准化的向量,因此您只需查看crossprod即可。 In your case this is equivalent to在你的情况下,这相当于

> sum(a*b)
[1] -0.05397935

(for real matrix operations, crossprod is much more efficient than constructing the equivalent operation by hand). (对于真正的矩阵运算, crossprod比手动构造等效运算要高效得多)。

As @Jack Maney's answer says, the dot product of two vectors (which is length(a)*length(b)*cos(a,b)) can be negative...正如@Jack Maney 的回答所说,两个向量的点积(长度(a)*length(b)*cos(a,b))可以是负数......

For what it's worth, I suspect that the cosine function in lsa might be more easily/efficiently implemented for matrix arguments as as.dist(crossprod(x)) ...对于它的价值,我怀疑lsa中的cosine function 对于矩阵 arguments as.dist(crossprod(x))可能更容易/更有效地实现...

edit : in comments on a now-deleted answer below, I suggested that the square of the cosine-distance measure might be appropriate if one wants a similarity measure on [0,1] -- this would be analogous to using the coefficient of determination (r^2) rather than the correlation coefficient (r) -- but that it might also be worth going back and thinking more carefully about the purpose/meaning of the similarity measures to be used...编辑:在下面对现在已删除的答案的评论中,我建议如果想要在 [0,1] 上进行相似性测量,余弦距离测量的平方可能是合适的——这类似于使用确定系数(r^2) 而不是相关系数 (r) - 但也可能值得回顾并更仔细地考虑要使用的相似性度量的目的/含义......

The cosine function returns cosine function 返回

crossprod(a, b)/sqrt(crossprod(a) * crossprod(b))

In this case, both the terms in the denominator are 1, but crossprod(a, b) is -0.05.在这种情况下,分母中的两项均为 1,但crossprod(a, b)为 -0.05。

The cosine function can take on negative values.余弦 function可以取负值。

While cosine of two vectors can take any value between -1 and +1, cosine similarity (in dicument retreival) used to take values from the [0,1] interval.虽然两个向量的余弦可以取 -1 和 +1 之间的任何值,但余弦相似度(在 dicument retreival 中)用于从 [0,1] 区间取值。 The reason is simple: in the WordxDocument matrix there are no negative values, so the maximum angle of two vectors is 90 degrees, for wich the cosine is 0.原因很简单:WordxDocument 矩阵中没有负值,所以两个向量的最大夹角为 90 度,余弦为 0。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM