简体   繁体   English

Hibernate 具有自动完成和模糊功能的搜索

[英]Hibernate Search with Autocomplete and Fuzzy-Functionality

I am trying to create a Hibernate Search representation of the StingUtils containsIgnoreCase() method together with fuzzy-search matching .我正在尝试创建StingUtils containsIgnoreCase()方法的 Hibernate 搜索表示以及模糊搜索匹配

Assume the user writes the letter "p", and they will get all matches that include the letter "p" (regardless whether the letter is located at the beginning, middle or end of the respective matches).假设用户写了字母“p”,他们将获得所有包含字母“p”的匹配项(无论该字母位于相应匹配项的开头、中间还是结尾)。

As they form words such as "Peter", they should also receive fuzzy-matches as eg"Petar", "Petaer" and "Peder" as well.当它们形成诸如“Peter”之类的词时,它们也应该接收到模糊匹配,例如“Petar”、“Petaer”和“Peder”。

I am using the custom query and index Analyzers provided in the great answer here , because I need minGramSize at 1 to allow for the autocomplete functionality, while at the same time I also expect multi-word user input separated by white spaces such as "EUR Account of Peter", which can be in different cases (lower or upper).我正在使用此处出色答案中提供的自定义查询和索引分析器,因为我需要minGramSize为 1 以允许自动完成功能,同时我还希望多字用户输入由空格分隔,例如“EUR彼得的帐户”,可以在不同的情况下(下或上)。

So a user should be able to type "AND" and receive the above example as a match.因此,用户应该能够键入“AND”并接收上述示例作为匹配项。

Currently, I am using the following query:目前,我正在使用以下查询:

  org.apache.lucene.search.Query fuzzySearchByName = qb.keyword().fuzzy()
                                                   .withEditDistanceUpTo(1).onField("name")
                                                   .matching(userInput).createQuery();
  booleanQuery.add(fuzzySearchByName, BooleanClause.Occur.MUST);

However , exact match cases do not receive presendence in the search results:但是,完全匹配的案例不会出现在搜索结果中:

If we type "petar", we get the following results:如果我们输入“petar”,我们会得到以下结果:

  1. Petarr (non-exact match) Petarr (非精确匹配)
  2. Petaer (non-exact match) Petaer (非精确匹配)

... 4. PETAR ( exact match ) ... 4. PETAR完全匹配

Same applies for user input of "peter", where the first result is "Petero", and the second is "Peter" (the second should be the first).同样适用于“peter”的用户输入,其中第一个结果是“Petero”,第二个是“Peter”(第二个应该是第一个)。

I also need to include only exact matches on multi-word queries - eg if I start writing " Account for... ", I wish all the matched results to include the phrase " Account for " and eventually its fuzzy-related terms based on that phrase (basically the same as the containsIgnoreCase() method showed earlier on, just trying to add fuzzy support) .我还需要在多词查询中只包含完全匹配 - 例如,如果我开始编写“ Account for... ”,我希望所有匹配的结果都包含短语“ Account for ”,并最终包含基于模糊相关的术语那个短语(基本上与前面显示的 containsIgnoreCase() 方法相同,只是试图添加模糊支持)

I guess however that this contradics with the minGramSize of 1 and the WhitespaceTokenizerFactory ?然而,我猜这与minGramSize的 1 和WhitespaceTokenizerFactory相矛盾?

However, exact match cases do not receive presendence in the search results:但是,完全匹配的案例不会出现在搜索结果中:

Just use two queries instead of one:只需使用两个查询而不是一个:

EDIT : you will also need to set up two separate fields for autocomplete and "exact" match;编辑:您还需要为自动完成和“精确”匹配设置两个单独的字段; see my edit at the bottom.在底部查看我的编辑。

  org.apache.lucene.search.Query exactSearchByName = qb.keyword().onField("name")
                                                   .matching(userInput).createQuery();
  org.apache.lucene.search.Query fuzzySearchByName = qb.keyword().fuzzy()
                                                   .withEditDistanceUpTo(1).onField("name")
                                                   .matching(userInput).createQuery();
  org.apache.lucene.search.Query searchByName = qb.boolean().should(exactSearchByName).should(fuzzySearchByName).createQuery();
  booleanQuery.add(searchByName, BooleanClause.Occur.MUST);

This will match documents that contain the user input exactly or approximately, so this will match the same documents as your example.这将完全近似地匹配包含用户输入的文档,因此这将匹配与您的示例相同的文档。 However, documents that contain the user input exactly will match both queries, while documents that only contain something similar will only match the fuzzy query.但是,包含用户输入的文档将完全匹配两个查询,而仅包含类似内容的文档将仅匹配模糊查询。 As a result, exact matches will have a higher score and end up higher up in the result list.结果,完全匹配将具有更高的分数并最终在结果列表中更高。

If exact matches are not high enough, try adding a boost to the exactSearchByName query:如果完全匹配不够高,请尝试向exactSearchByName查询添加提升:

  org.apache.lucene.search.Query exactSearchByName = qb.keyword().onField("name")
                                                   .matching(userInput)
                                                   .boostedTo(4.0f)
                                                   .createQuery();

I guess however that this contradics with the minGramSize of 1 and the WhitespaceTokenizerFactory?然而,我猜这与 1 的 minGramSize 和 WhitespaceTokenizerFactory 相矛盾?

If you want to match documents that contain any word (but not necessarily all words) appearing in the user input, and to put documents containing more words higher in the result list, do what I explained above.如果您想匹配包含出现在用户输入中的任何单词(但不一定是所有单词)的文档,并将包含更多单词的文档放在结果列表中的较高位置,请执行我上面解释的操作。

If you want to match documents that contain all words in the exact same order, use a KeywordTokenizerFactory (ie no tokenizing).如果要匹配包含完全相同顺序的所有单词的文档,请使用KeywordTokenizerFactory (即不进行标记化)。

If you want to match documents that contain all words in any order, well... that's less obvious.如果您想以任何顺序匹配包含所有单词的文档,那么......这不太明显。 There's no support for that in Hibernate Search ( yet ), so you will essentially have to build the query yourself. Hibernate 搜索(尚未)中不支持该功能,因此您基本上必须自己构建查询。 One hack that I've already seen is something like this:我已经看到的一个 hack 是这样的:

Analyzer analyzer = fullTextSession.getSearchFactory().getAnalyzer( "myAnalyzer" );

QueryParser queryParser = new QueryParser( "name", analyzer );
queryParser.setOperator( Operator.AND ); // Match *all* terms
Query luceneQuery = queryParser.parse( userInput );

... but that will not generate fuzzy queries. ...但这不会产生模糊查询。 If you want fuzzy queries, you can try to override some methods in a custom subclass of QueryParser.如果你想要模糊查询,你可以尝试覆盖 QueryParser 的自定义子类中的一些方法。 I didn't try this, but it might work:我没有尝试过,但它可能会起作用:

public final class FuzzyQueryParser extends QueryParser {
    private final int maxEditDistance;
    private final int prefixLength;

    public FuzzyQueryBuilder(String fieldName, Analyzer analyzer, int maxEditDistance, int prefixLength) {
        super( fieldName, analyzer );
        this.maxEditDistance = maxEditDistance;
        this.prefixLength = prefixLength;
    }

    @Override
    protected Query newTermQuery(Term term) {
        return new FuzzyQuery( term, maxEditDistance, prefixLength );
    }
}

EDIT : With a minGramSize of 1, you will get lots of very frequent terms: single or two-character terms extracted from the beginning of words.编辑: minGramSize 为 1 时,您将获得很多非常频繁的术语:从单词开头提取的单个或两个字符的术语。 It is likely to cause many unwanted matches that will be scored high (because the terms are frequent) and will probably drown exact matches.这可能会导致许多不需要的匹配项得分很高(因为这些术语很频繁)并且可能会淹没完全匹配项。

First, you can try setting the similarity (~ scoring formula) to org.apache.lucene.search.similarities.BM25Similarity , which is better at ignoring very frequent terms.首先,您可以尝试将相似度(〜评分公式)设置为org.apache.lucene.search.similarities.BM25Similarity ,它更适合忽略非常频繁的术语。 See here for the setting .有关设置,请参见此处 That should improve scoring with the same analyzers.这应该会提高使用相同分析仪的评分。

Second, you can try setting up two fields instead of one: one field for fuzzy autocomplete and one for non-fuzzy, complete matches.其次,您可以尝试设置两个字段而不是一个:一个用于模糊自动完成,另一个用于非模糊、完整匹配。 That may improve the score of exact matches since there will be less meaningless terms indexed for the field used for exact matches.这可能会提高精确匹配的分数,因为用于精确匹配的字段索引的无意义术语将更少。 Just do this:只需这样做:

@Field(name = "name", analyzer = @Analyzer(definition = "text")
@Field(name = "name_autocomplete", analyzer = @Analyzer(definition = "edgeNgram")
private String name;

The analyzer "text" is just the analyzer "edgeNGram_query" from the answer you linked ;分析器“文本”只是您链接的答案中的分析器“edgeNGram_query”; just rename it.只是重命名它。

The proceed with writing two queries instead of one as explained above, but make sure to target two different fields:继续编写两个查询而不是如上所述的一个,但请确保针对两个不同的字段:

  org.apache.lucene.search.Query exactSearchByName = qb.keyword().onField("name")
                                                   .matching(userInput).createQuery();
  org.apache.lucene.search.Query fuzzySearchByName = qb.keyword().fuzzy()
                                                   .withEditDistanceUpTo(1).onField("name_autocomplete")
                                                   .matching(userInput).createQuery();
  org.apache.lucene.search.Query searchByName = qb.boolean().should(exactSearchByName).should(fuzzySearchByName).createQuery();
  booleanQuery.add(searchByName, BooleanClause.Occur.MUST);

Don't forget to reindex after those changes, of course.当然,不要忘记在这些更改之后重新索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM