余弦相似度如何与K-means算法一起使用？

Question

For three text document vectors having different length in their vectors in VSM where entries are tf-idf of terms: 对于在VSM中向量长度不同的三个文本文档向量，其中条目是术语的tf-idf：

Q1: how cosine similarity used by k-means does then how the clusters are constructed. 问题1： k均值使用的余弦相似度如何完成，然后如何构建聚类。

Q2: when I use TF-IDF algo. 问题2：当我使用TF-IDF算法时。 Its produce a negative values is there any problem in my calculation? 它产生一个负值，我的计算中有问题吗？

Please use the following docs vectors is VSM (tf.idf) where all have different vector length for explanation purposes. 请使用以下docs矢量为VSM（tf.idf），其中所有矢量的长度都不同，以进行说明。

Doc1 (0.134636045, -0.000281926, -0.000281926, -0.000281926, -0.000281926, 0)
Doc2 (-0.002354898, 0.012411358, 0.012411358, 0.09621575, 0.3815553)
Doc3(-0.001838258, 0.009688438, 0.019376876, 0.05633028, 0.59569238, 0.103366223, 0)

i will thank any one can give explanation about my question. 我将感谢任何可以对我的问题做出解释的人。

Answer 1

Cosine similarity means you take the dot product of the vector / k mean centre rather than the Euclidean distance. 余弦相似度意味着您采用矢量/ k平均中心而不是欧几里得距离的点积。

Dot product is ax bx + ay by ... + a.zz*b.zz for all the dimensions. 对于所有尺寸，点积均为ax bx + ay x ... + a.zz * b.zz。 You generally normalize the vectors first. 通常，通常先对向量进行归一化。 Then call acos() on the result. 然后在结果上调用acos（）。

Essentially you're dividing the results into sectors rather than into randomly-clumped clusters. 本质上，您是将结果划分为扇区，而不是随机聚集的集群。

余弦相似度如何与K-means算法一起使用？

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-02-07 17:48:10

余弦相似度如何与K-means算法一起使用？

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-02-07 17:48:10

解决方案1
0 已采纳 2017-02-07 17:48:10