简体繁体 English

如何计算查询和文档之间的相似度？

[英]how to calculate similarity between a query and documents?

原文 2011-03-14 09:19:50 4 2 java

I have a set of documents and i have calculate both 我有一套文件，我都计算了

Term -Frequency score 词频得分
Inverse-Frequency Score 逆频得分
TF/IDF score TF / IDF分数

Now i need to calculate the similarity between a specific query and a document which will produce a score that will rank the document from the highest similarity to the lowest similarity towards the query. 现在，我需要计算特定查询和文档之间的相似度，该相似度将产生一个分数，该分数将对查询从最高相似度到最低相似度对文档进行排名。

I have search for a lot of information but i do no understand the formula. 我已经搜索了很多信息，但我不理解该公式。

source : http://en.wikipedia.org/wiki/Vector_space_model 来源： http : //en.wikipedia.org/wiki/Vector_space_model

Can anyone guide me ? 谁能指导我？ I just need to know how to proceed from my current progress. 我只需要知道如何从当前进展中继续前进即可。

2 个解决方案

Lucene是一个开源库，可以为您完成所有这些工作。

Pangea has already given the correct answer: Don't reinvent the wheel, especially a complex wheel like document similarity. Pangea已经给出了正确的答案：不要重新发明轮子，尤其是像文档相似性这样的复杂轮子。 That being said, understanding how document similarity is computed is an interesting and worth while thing to do if you are going to be working in the field. 话虽这么说，但如果您打算在现场工作，那么了解如何计算文档相似度是一件有趣且值得的事情。 I'll see if I can help a bit. 我会帮忙的。

The basic assumption of the Vector space model you have linked is that each document can be represented as a vector in N dimensional space, where each dimension is a different word in the universe of documents. 您链接的向量空间模型的基本假设是，每个文档都可以表示为N维空间中的矢量，其中每个维在文档范围中是一个不同的词。 A document's value for a given word is that document's rank for the word in question. 给定单词的文档价值就是该单词在该文档中的排名。 In this model, a query can be thought of as a very short document, and thus also represented as a vector in N space. 在此模型中，查询可以被视为非常短的文档，因此也可以表示为N空间中的向量。 The cosine measure is simply the cosine of the angle between the query vector and a given document vector. 余弦量度只是查询向量和给定文档向量之间角度的余弦。

Deriving N dimensional trigonometry is probably a math course in and of itself, but if you understand the basic idea, for the actual computation you can take the Wikipedia formula on faith (or look in a standard text for it if you prefer). 推导N维三角学本身可能就是一门数学课程，但是，如果您了解基本思想，则对于实际计算，您可以采用基于信念的Wikipedia公式（或者根据需要查看标准文本）。 The computational steps (vector dot products and norms) are also well documented individually and not terribly hard to implement. 计算步骤（矢量点乘积和范数）也单独记录在案，并不难实现。 I'm sure there are also standard library implementations available. 我确定也可以使用标准库实现。

The logic behind the cosine is that, as the similarity between the documents increases, the angle between the two vectors approaches zero (and thus the cosine approaches 1). 余弦背后的逻辑是，随着文档之间相似度的增加，两个向量之间的角度接近零（因此，余弦接近1）。 You can verify this by hand with a universe of two words on the Cartesian plane. 您可以在笛卡尔平面上用两个单词的宇宙手工验证这一点。 All the vector math does there is extrapolate the same concept into N dimensions. 所有的向量数学运算都将相同的概念外推到N维。

I hope this clears up some confusion on this interesting topic. 我希望这可以消除对这个有趣话题的困惑。 For actual implementation, I once again refer you to Pangea's suggestion to use Lucene. 对于实际的实现，我再次向您推荐Pangea关于使用Lucene的建议。