简体繁体 English

计算Lucene文档和质心之间的相似度

[英]Calculating similarity between and centroid of Lucene documents

原文 2010-08-10 08:24:02 6 3 java/ lucene/ cluster-analysis/ similarity/ tf-idf

In order to perform a simple clustering algorithm on results that I get from Lucene, I have to calculate Cosine similarity between 2 documents in Lucene, I also need to be able to make a centroid document to represent the centroid of each cluster. 为了对从Lucene获得的结果执行简单的聚类算法，我必须计算Lucene中2个文档之间的余弦相似度，我还需要能够制作一个质心文档来表示每个聚类的质心。

All I can think of doing is building my own Vector Space model with tf-idf weighting, using the TermFreqVectors and Overall Term frequencies to populate it. 我能想到的就是使用TermFreqVectors和Total Term频率填充它，并使用tf-idf权重构建我自己的向量空间模型。

My question is: This is not an efficient approach, is there a better way to do this? 我的问题是：这不是一种有效的方法，有没有更好的方法呢？

This feels a little unclear so any suggestions on how I can improve my question are also appreciated. 这感觉有点不清楚，因此也对我如何改善问题的任何建议表示赞赏。

3 个解决方案

马克，您可能会发现将Mahout与Lucene集成，将IR Math与Java或使用Lucene进行向量空间分类器集成非常有用。

in order to get similarity of one document to the other, why not make a one query with the content of one document and run query against index? 为了获得一个文档与另一个文档的相似性，为什么不使用一个文档的内容进行一个查询并针对索引运行查询？ that way, you will get score(cosine similarity values) 这样，您将获得分数（余弦相似度值）

The short answer is: No. 最简洁的答案是不。

I have spent a lot of time (way way too much) looking into this, and as far as I can see, you can make your own Vector Space Model and work from that, or use Mahout to generate a Mahout Vector, which you can make comparisons between documents from. 我花了很多时间（太多了）研究这个问题，据我所知，您可以创建自己的向量空间模型并以此为基础进行工作，或者使用Mahout生成Mahout向量，您可以在文档之间进行比较。 I am gonna go ahead and make my own, so I'm marking this question answered! 我要继续做我自己的，所以我将此问题标记为已回答！