简体   繁体   English

Lucene:根据相关性进行搜索和检索

[英]Lucene: Search and retrieve based on relevance

I am using lucene for indexing and searching. 我正在使用lucene进行索引和搜索。 Below is my code I use for searching. 以下是我用于搜索的代码。 But in the current code the results are sorted. 但是在当前代码中,对结果进行了排序。 But I want the results to be based on the relevance. 但我希望结果基于相关性。 Suppose If I search for a word like "abc", I want my search get the results that match "abc" and then "ab" or "bc" and finally "a", "b", "c" but currently the results are sorted. 假设如果我搜索“ abc”之类的词,我希望搜索结果匹配“ abc”,然后匹配“ ab”或“ bc”,最后匹配“ a”,“ b”,“ c”,但当前结果被排序。

Can some one suggest me how to retrieve the results based on the relevance, when we do search on multiple words. 当我们对多个单词进行搜索时,有人可以建议我如何根据相关性检索结果。 Thanks for your help. 谢谢你的帮助。

By default, Lucene sorts based on TEXT-RELEVANCE only. 默认情况下,Lucene仅基于TEXT-RELEVANCE进行排序。 There are quite a few factors that contribute to the relevance score. 有很多因素会影响相关性得分。

It is possible that tf-idf values and length normalization might have affected your scores resulting in having "ab" / "bc" documents show up at top ranked results than the documents containing "abc". tf-idf值和长度规范化可能会影响您的分数,从而导致“ ab” /“ bc”文档比包含“ abc”的文档显示在排名靠前的结果上。

The way you can overcome above is that To boost the relevance score based on number of matching query terms. 上面您可以克服的方法是根据匹配查询词的数量来提高相关性得分。 You may follow the below steps. 您可以按照以下步骤操作。

1) Write a customized Similarity class extending from DefaultSimilarity . 1)编写一个从DefaultSimilarity扩展的自定义相似类。 If you are wondering what's Similarity, it is the class used by Lucene that contains all the formulas of scoring factors that contribute to the score. 如果您想知道相似性是什么,Lucene使用的类包含有助于得分的所有评分因子公式。

Tutorial : Lucene Scoring 教程: Lucene计分

2) Override DefaultSimilarity.coord() 2)覆盖DefaultSimilarity.coord()

coord() explanation in the Lucene documentation. Lucene文档中的coord()说明。

coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q,d) by the Similarity in effect at search time. 

3) The default implementation of coord is overlap/maxoverlap. 3)坐标的默认实现是重叠/最大重叠。 You may experiment with different formulas such that the documents containing more query words show up in the top results. 您可以尝试使用不同的公式,以使包含更多查询词的文档出现在顶部结果中。 The following formulas might be good starting points. 以下公式可能是一个很好的起点。

   1) coord return value = Math.sqrt(overlap/maxoverlap)
   2) coord return value = overlap;

4) You do NOT have to override other methods since the DefaultSimilarity has default implementations for all scoring factors. 4)您不必重写其他方法,因为DefaultSimilarity具有所有评分因子的默认实现。 Just touch the one you want to experiment with, which is coord() in your case. 只需触摸您要尝试的那个,在您的情况下就是coord()。 If you extend from Similarity , you've to provide all the implementations. 如果您从相似性扩展,则必须提供所有实现。

5) Similarity can be passed to the IndexSearcher using IndexSearcher.setSimilarity() 5)可以使用IndexSearcher.setSimilarity()将相似性传递给IndexSearcher

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM