简体   繁体   English

Hibernate 搜索模糊2个以上

[英]Hibernate search fuzzy more than 2

I have a Java backend with hibernate, lucene and hibernate-search.我有一个 Java 后端,带有 hibernate、lucene 和休眠搜索。 Now I want to do a fuzzy query, BUT instead of 0, 1, or 2, I want to allow more "differences" between the query and the expected result (to compensate for example misspelling in long words).现在我想做一个模糊查询,但不是 0、1 或 2,我想允许查询和预期结果之间有更多的“差异”(以补偿例如长词中的拼写错误)。 Is there any way to achieve this?有什么办法可以做到这一点? The maximum of allowed differences will later be calculated by the length of the query.稍后将根据查询的长度计算允许的最大差异。
What I want this for, is an autocomplete search with correction of wrong letters.我想要的是自动完成搜索并纠正错误的字母。 This autocomplete should only search for missing characters BEHIND the given query, not in front of it.此自动完成应该只搜索给定查询后面的缺失字符,而不是前面的。 If characters in front of the query compared to the entry are missing, they should be counted as difference.如果查询前面的字符与条目相比缺失,则应计为差异。

Examples: Maximum allowed different characters in this example is 2. fooo should match示例:此示例中允许的最大不同字符数为 2。 fooo应匹配

fooo       (no difference)
fooobar    (only characters added -> autocomplete)
fouubar    (characters added and misspelled -> autocomplete and spelling correction)

fooo should NOT match fooo不应该匹配

barfooo    (we only allow additional characters behind the query, but this example is less important)
fuuu       (more than 2 differences)

This is my current code for the SQL query:这是我当前的 SQL 查询代码:

FullTextEntityManager fullTextEntityManager = this.sqlService.getFullTextEntityManager();
QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(MY_CLASS.class).overridesForField("name", "foo").get();
Query query = queryBuilder.keyword().fuzzy().withEditDistanceUpTo(2).onField("name").matching("QUERY_TO_MATCH").createQuery();
FullTextQuery fullTextQuery = fullTextEntityManager.createFullTextQuery(query, MY_CLASS.class);
List<MY_CLASS> results = fullTextQuery.getResultList();

Notes:笔记:
1. I use org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory for indexing, but that should not make any change. 1. 我使用org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory进行索引,但这不应该做任何改变。
2. This is using a custom framework, which is not open source. 2.这是使用自定义框架,不是开源的。 You can just ignore the sqlService , it only provides the FullTextEntityManager and handles all things around hibernate, which do not require custom code each time.您可以忽略sqlService ,它只提供FullTextEntityManager并处理 hibernate 周围的所有事情,每次都不需要自定义代码。
3. This code does already work, but only with withEditDistanceUpTo(2) , which means maximum 2 "differences" between QUERY_TO_MATCH and the matching entry in the database or index. 3. 这段代码已经可以工作了,但只适用于withEditDistanceUpTo(2) ,这意味着QUERY_TO_MATCH和数据库或索引中的匹配条目之间最多有 2 个“差异”。 Missing characters also count as differences.缺少的字符也算作差异。
4. withEditDistanceUpTo(2) does not accept values greater than 2. 4. withEditDistanceUpTo(2)不接受大于 2 的值。

Does anyone have any ideas to achieve that?有没有人有任何想法来实现这一目标?

I am not aware of any solution where you would specify an exact number of changes that are allowed.我不知道有任何解决方案可以指定允许的确切更改数量。

That approach has serious drawbacks, anyway: what does it mean to match "foo" with up to 3 changes?无论如何,这种方法有严重的缺点:将“foo”与最多 3 个更改匹配意味着什么? Just match anything?随便什么都配? As you can see, a solution that works with varying term lengths might be better.如您所见,适用于不同期限长度的解决方案可能会更好。

One solution is to index n-grams.一种解决方案是索引 n-gram。 I'm not talking about edge-ngrams, like you already do, but actual ngrams extracted from the whole term, not just the edges.我不是在谈论边缘 ngram,就像你已经做过的那样,而是从整个术语中提取的实际 ngram,而不仅仅是边缘。 So when indexing 2-grams of foooo , you would index:因此,当索引 2 克foooo时,您将索引:

  • fo
  • oo (occurring multiple times) oo (出现多次)

And when querying, the term fouuu would be transformed to:在查询时,术语fouuu将转换为:

  • fo
  • ou
  • uu

... and it would match the indexed document, since they have at least one term in common ( fo ). ...并且它将匹配索引文档,因为它们至少有一个共同的术语( fo )。

Obviously there are some drawbacks.显然有一些缺点。 With 2-grams, the term fuuuu wouldn't match foooo , but the term barfooo would, because they have a 2-gram in common.对于 2-gram,术语fuuuu不会匹配foooo ,但术语barfooo会匹配,因为它们有一个 2-gram 的共同点。 So you would get false positives.所以你会得到误报。 The longer the grams, the less likely you are to get false positives, but the less fuzzy your search will be.克数越长,您获得误报的可能性就越小,但您的搜索就越模糊。

You can make these false positives go away by relying on scoring and on a sort by score to place the best matches first in the result list.您可以依靠得分和按得分排序将最佳匹配项放在结果列表中的首位,从而使这些误报 go 消失。 For example, you could configure the ngram filter to preserve the original term, so that fooo will be transformed to [ fooo , fo , oo ] instead of just [ fo , oo ], and thus an exact search of fooo will have a better score for a document containing fooo than for a document containing barfooo (since there are more matches).例如,您可以配置 ngram 过滤器以保留原始术语,这样fooo将被转换为 [ fooo , fo , oo ] 而不仅仅是 [ fo , oo ],因此精确搜索fooo会有更好的分数对于包含fooo barfooo文档(因为匹配项更多)。 You could also set up multiple separate fields: one without ngrams, one with 3-grams, one with 2-grams, and build a boolean query with on should clause per field: the more clauses are matched, the higher the score will be, and the higher you will find the document in the hits.您还可以设置多个单独的字段:一个不带 ngram,一个带 3-gram,一个带 2-gram,并构建一个 boolean 查询,每个字段都带有 on should子句:匹配的子句越多,得分越高,并且您会在点击中找到更高的文档。

Also, I'd argue that fooo and similar are really artificial examples and you're unlikely to have these terms in a real-world dataset;另外,我认为fooo和类似的东西确实是人为的例子,你不太可能在现实世界的数据集中拥有这些术语; you should try whatever solution you come up with against a real dataset and see if it works well enough.您应该尝试针对真实数据集提出的任何解决方案,看看它是否足够好。 If you want fuzzy search, you'll have to accept some false positives: the question is not whether they exist, but whether they are rare enough that users can still easily find what they are looking for.如果你想要模糊搜索,你将不得不接受一些误报:问题不在于它们是否存在,而在于它们是否足够稀有以至于用户仍然可以轻松找到他们正在寻找的东西。

In order to use ngrams, apply the n-gram filter using org.apache.lucene.analysis.ngram.NGramFilterFactory .为了使用 ngram,请使用org.apache.lucene.analysis.ngram.NGramFilterFactory应用 n-gram 过滤器。 Apply it both when indexing and when querying.在索引和查询时都应用它。 Use the parameters minGramSize / maxGramSize to configure the size of ngrams, and keepShortTerm ( true / false ) to control whether to preserve the original term or not.使用参数minGramSize / maxGramSize配置 ngram 的大小,并keepShortTerm ( true / false ) 控制是否保留原始术语。

You may keep the edge-ngram filter or not;您可以保留或不保留 edge-ngram 过滤器; see if it improves the relevance of your results?看看它是否提高了结果的相关性? I suspect it may improve the relevance slightly if you use keepShortTerm = true .我怀疑如果您使用keepShortTerm = true可能会稍微提高相关性。 In any case, make sure to apply the edge-ngram filter before the ngram filter.在任何情况下,请确保在 ngram 过滤器之前应用 edge-ngram 过滤器。

Ok, my friend and I found a solution.好的,我和我的朋友找到了解决方案。 We found a question in the changelog of lucene which asks for the same feature, and we implemented a solution : There is a SlowFuzzyQuery in a sandbox version of lucene.我们在 lucene 的变更日志中发现了一个要求相同功能的问题,我们实施了一个解决方案SlowFuzzyQuery的沙盒版本中有一个 SlowFuzzyQuery。 It is slower (obviously) but supports an editDistance greater than 2.它速度较慢(显然),但支持大于 2 的 editDistance。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM