简体   繁体   English

余弦相似度如何与K-means算法一起使用?

[英]How does cosine similarity used with K-means algorithm?

For three text document vectors having different length in their vectors in VSM where entries are tf-idf of terms: 对于在VSM中向量长度不同的三个文本文档向量,其中条目是术语的tf-idf:

Q1: how cosine similarity used by k-means does then how the clusters are constructed. 问题1: k均值使用的余弦相似度如何完成,然后如何构建聚类。

Q2: when I use TF-IDF algo. 问题2:当我使用TF-IDF算法时。 Its produce a negative values is there any problem in my calculation? 它产生一个负值,我的计算中有问题吗?

Please use the following docs vectors is VSM (tf.idf) where all have different vector length for explanation purposes. 请使用以下docs矢量为VSM(tf.idf),其中所有矢量的长度都不同,以进行说明。

Doc1 (0.134636045, -0.000281926, -0.000281926, -0.000281926, -0.000281926, 0)
Doc2 (-0.002354898, 0.012411358, 0.012411358, 0.09621575, 0.3815553)
Doc3(-0.001838258, 0.009688438, 0.019376876, 0.05633028, 0.59569238, 0.103366223, 0) 

i will thank any one can give explanation about my question. 我将感谢任何可以对我的问题做出解释的人。

Cosine similarity means you take the dot product of the vector / k mean centre rather than the Euclidean distance. 余弦相似度意味着您采用矢量/ k平均中心而不是欧几里得距离的点积。

Dot product is ax bx + ay by ... + a.zz*b.zz for all the dimensions. 对于所有尺寸,点积均为ax bx + ay x ... + a.zz * b.zz。 You generally normalize the vectors first. 通常,通常先对向量进行归一化。 Then call acos() on the result. 然后在结果上调用acos()。

Essentially you're dividing the results into sectors rather than into randomly-clumped clusters. 本质上,您是将结果划分为扇区,而不是随机聚集的集群。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM