简体   繁体   English

计算Lucene文档和质心之间的相似度

[英]Calculating similarity between and centroid of Lucene documents

In order to perform a simple clustering algorithm on results that I get from Lucene, I have to calculate Cosine similarity between 2 documents in Lucene, I also need to be able to make a centroid document to represent the centroid of each cluster. 为了对从Lucene获得的结果执行简单的聚类算法,我必须计算Lucene中2个文档之间的余弦相似度,我还需要能够制作一个质心文档来表示每个聚类的质心。

All I can think of doing is building my own Vector Space model with tf-idf weighting, using the TermFreqVectors and Overall Term frequencies to populate it. 我能想到的就是使用TermFreqVectors和Total Term频率填充它,并使用tf-idf权重构建我自己的向量空间模型。

My question is: This is not an efficient approach, is there a better way to do this? 我的问题是:这不是一种有效的方法,有没有更好的方法呢?

This feels a little unclear so any suggestions on how I can improve my question are also appreciated. 这感觉有点不清楚,因此也对我如何改善问题的任何建议表示赞赏。

in order to get similarity of one document to the other, why not make a one query with the content of one document and run query against index? 为了获得一个文档与另一个文档的相似性,为什么不使用一个文档的内容进行一个查询并针对索引运行查询? that way, you will get score(cosine similarity values) 这样,您将获得分数(余弦相似度值)

The short answer is: No. 最简洁的答案是不。

I have spent a lot of time (way way too much) looking into this, and as far as I can see, you can make your own Vector Space Model and work from that, or use Mahout to generate a Mahout Vector, which you can make comparisons between documents from. 我花了很多时间(太多了)研究这个问题,据我所知,您可以创建自己的向量空间模型并以此为基础进行工作,或者使用Mahout生成Mahout向量,您可以在文档之间进行比较。 I am gonna go ahead and make my own, so I'm marking this question answered! 我要继续做我自己的,所以我将此问题标记为已回答!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM