简体   繁体   English

与Lucene的余弦相似度仅适用于匹配的文档

[英]Cosine Similarity with Lucene only for documents that match

Lucene is a inverse indexing system, as far as I understand, its power lies in the fact that it will compare a query only with documents that at least match a token. 据我了解,Lucene是一个反向索引系统,它的强大之处在于它只将查询与至少与令牌匹配的文档进行比较。

Compared to the naive approach where the query is compared to every document, (even those that don't mention any token that is present in the query) this is a great benefit. 与将查询与每个文档进行比较的幼稚方法(甚至是那些没有提及查询中存在的任何标记的方法)相比,这是一个很大的好处。

For example if I have the indexed documents: 例如,如果我有索引文件:

D1: "Hello world said the guy"
D2: "Hello, what a beautiful world"
D3: "random text"

As I see it , the search for query: " Hello world ", will only look into the indexed documents D1 and D2 and skips on D3, which saves time. 正如我所看到的 ,搜索查询“ Hello world ”将仅查看索引文档D1和D2并跳过D3,从而节省了时间。

Is this correct? 这个对吗?

Now, I'm trying to calculate the cosine similarity between documents. 现在,我正在尝试计算文档之间的余弦相似度。 The input query will be a document and the output should be the cosine score. 输入查询将是一个文档,输出应该是余弦分数。 Which is a number between 0 and 1. 介于0和1之间的数字。

I've already read some approaches that calculate the cosine similarity, but they all do this by comparing the term vector for every document. 我已经读过一些计算余弦相似度的方法,但是它们都是通过比较每个文档的术语向量来实现的。 For example this blog mentioned the following: 例如, 博客提到以下内容:

If you really need cosine similarity between documents, you have to enable term vectors for the source fields, and use them to calculate the angle. 如果您确实需要文档之间的余弦相似度,则必须为源字段启用项向量,然后使用它们来计算角度。 The problem is that this does not scale well, you would need to calculate angles with virtually all other documents . 问题是这不能很好地缩放,您实际上需要与所有其他文档一起计算角度

and this SO answers seems to say the same: 和这样的答案似乎说的是相同的:

  1. iterate over all doc ids , 0 to maxDoc(); 遍历所有doc id ,从0到maxDoc();

Isn't there a way to only calculate the cosine similarity for documents that match the query and let this return as score for the document? 难道没有办法只计算与查询匹配的文档的余弦相似度,并将其作为文档的分数返回吗?

As a side note, I did read that the TFIDFSimilarity comes close, I believe the VSM part is exactly what I need, however this part seems to have disappeared in the Lucene Practical Scoring Function. 作为附带说明,我确实读过TFIDFS类似性 ,我相信VSM部分正是我所需要的,但是该部分似乎已在Lucene实用评分功能中消失了。 I'm not sure how I can "transform" this Similarity class to end up with only the pure cosine similarity as result. 我不确定如何才能将“相似性”类“转换”为仅产生纯余弦相似性。

So a recap of my question: 所以我的问题回顾:

  1. Is my perception of how the inverse indexes save time correct? 我对逆索引如何节省时间的看法正确吗?

  2. Is there way to only calculate cosine similarity for documents that actually match one of the tokens, instead of for all the documents? 有没有办法只计算与令牌之一实际匹配的文档的余弦相似度,而不是所有文档的余弦相似度?

  3. Can I use/transform the TFIDFSimilarity class to end up with the pure cosine similarity? 我可以使用/转换TFIDFSimilarity类来获得纯余弦相似度吗?
  1. It pretty much depends on how you formulate your query. 这在很大程度上取决于您如何制定查询。 If you formulate a BooleanQuery you can specify which terms of the query must be in the returned document. 如果您制定了BooleanQuery,则可以指定查询的哪些条件必须在返回的文档中。 This is done using BoolenClause.Occur.MUST . 这是使用BoolenClause.Occur.MUST完成的。

  2. You can write your own similarity by extending TFIDFSimilarity but as you may noticed the Lucene practical scoring is based on cosine similarity. 您可以通过扩展TFIDFSimilarity来编写自己的相似度,但是您可能会注意到Lucene实际评分基于余弦相似度。 In that formula, queryNorm(q) and norm(t, d) form the denominator of cosine similarity and the summation is the dot product of query vector and document vector. 在该公式中,queryNorm(q)和norm(t,d)构成余弦相似性的分母,并且总和是查询向量和文档向量的点积。

hint: you can form a sample query and use explain() to see the details of scoring. 提示:您可以构成一个示例查询,并使用explain()查看评分的详细信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM