简体   繁体   English

在 Apache Solr 中词干搜索和模糊搜索是否一起工作

[英]Does stemming and fuzzy search work together in Apache Solr

I am using porter filter factory for a field which has 3 to 4 words in it.我正在将搬运工过滤器工厂用于其中包含 3 到 4 个单词的字段。

Eg : "ABC BLOSSOM COMPANY"例如:“ABC BLOSSOM 公司”

I expect to fetch the above document when i search for ABC BLOSSOMING COMPANY as well.我也希望在搜索 ABC BLOSSOMING COMPANY 时获取上述文档。

When i query this:当我查询时:

name:ABC AND name:BLOSSOMING AND name:COMPANY

i get my result我得到我的结果

This is what the parsed query looks like这是解析后的查询的样子

+name:southern +name: blossom + name:compani (Stemmer works fine) +名称:南方 +名称:开花+名称:公司(Stemmer 工作正常)

But when i add the fuzzy syntax and query like this,但是当我像这样添加模糊语法和查询时,

name:ABC~1 AND name:BLOSSOMING~1 AND name:COMPANY~1

the search does not give any documents as result and the parsed query looks like this搜索不提供任何文档作为结果,解析后的查询如下所示

+name:abc~1 +name: blossoming ~1 +name: company ~2 +名称:abc~1 +名称:盛开~1 +名称:公司~2

This clearly shows that stemming is not happening.这清楚地表明词干提取没有发生。 Kindly review and give feedback.请查看并提供反馈。

TL;DR TL; 博士
Stemming is not happening, since you have used the PorterFilter, which is not a MultiTermAwareComponent . Stemming 没有发生,因为您使用了 PorterFilter,它不是MultiTermAwareComponent

What To Do?该怎么办?
Use one of the Filters/Normalizers that implements the MultiTermAwareComponent interface.使用实现MultiTermAwareComponent接口的过滤器/标准化器之一。

Explanation解释
You, like many others, are caught by Solr's and Lucense Multiterm behaviour.您和其他许多人一样,会被 Solr 和 Lucense Multiterm 的行为所吸引。 There is a good article about this topic on the Solr wiki. Solr wiki 上有一篇关于此主题的好文章 All though this article is dated, it still holds true尽管这篇文章已经过时,但它仍然适用

One of the surprises for most Solr users is that wildcards queries haven't gone through any analysis.大多数 Solr 用户的惊喜之一是通配符查询没有经过任何分析。 Practically, this means that wildcard (and prefix and range) queries are case sensitive, which is at odds with expectations.实际上,这意味着通配符(以及前缀和范围)查询区分大小写,这与预期不一致。 As of this SOLR-2438, SOLR-2918, and perhaps SOLR-2921, this behavior is changed.自此 SOLR-2438、SOLR-2918 或 SOLR-2921 起,此行为已更改。

What's a multiterm you ask?你问什么是多项式? Essentially it's any term that may "point to" more than one real term.本质上,它是可以“指向”多个实际术语的任何术语。 For instance, run* could expand to runs, runner, running, runt, etc. Likewise, a range query is really a "multiterm" query as well.例如,run* 可以扩展为runs、runner、runt、runt 等。同样,范围查询实际上也是一个“多项”查询。 Before Solr 3.6, these were completely unprocessed, the application layer usually had to apply any transformations required, for instance lower-casing the input.在 Solr 3.6 之前,这些是完全未处理的,应用层通常必须应用任何所需的转换,例如输入的小写。 Running these types of terms through a "normal" query analysis chain leads to all sorts of interesting behavior so was avoided.通过“正常”查询分析链运行这些类型的术语会导致各种有趣的行为,因此被避免。

Well here's the configuration that somewhat did it for me, while experimenting:好吧,这是在试验时对我有点帮助的配置:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.FlattenGraphFilterFactory"/>        
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

(yes, I modified existing "text_general" field, I said I was experimenting) (是的,我修改了现有的“text_general”字段,我说我正在试验)

Using it with fuzzy edit distance 2, it produced following results for term "neglect":将它与模糊编辑距离 2 一起使用,它对术语“忽略”产生了以下结果:

1. Lost in Translation - A faded movie star and a neglected young woman...
2. Election - A high school teacher meets his match in an over-achieving...
3. Annie Hall - Alvy Singer, a divorced Jewish comedian, reflects on his relationship...

Which is somewhat good because the first result is appropriate.这有点好,因为第一个结果是合适的。

Yet, if I search for "rescuing" with fuzzy search enabled, it produces nothing.然而,如果我在启用模糊搜索的情况下搜索“救援”,它不会产生任何结果。 And if fuzzy is disabled, the results are:如果禁用模糊,则结果为:

1. The Searchers - ... a years-long journey to rescue his niece from ...
2. Star Wars - ...while also attempting to rescue Princess Leia from...

So, the results of fuzzy + stemming is fairly inconsistent.所以,模糊+词干的结果是相当不一致的。 Elasticsearch, which is Lucene based like SOLR, doesn't recommend using fuzzy with stemming: Elasticsearch,它像 SOLR 一样基于 Lucene,不建议使用带有词干提取的模糊:

This also means that if using say, a snowball analyzer, a fuzzy search for 'running', will be stemmed to 'run', but will not match the misspelled word 'runninga', which stems to 'runninga', because 'run' is more than 2 edits away from 'runninga'.这也意味着,如果使用 say,一个雪球分析器,对“running”的模糊搜索,将被归为“run”,但不会匹配拼错的单词“runninga”,它的词根是“runninga”,因为“run”与 'runninga' 相距超过 2 个编辑。 This can cause quite a bit of confusion, and for this reason, it often makes sense only to use the simple analyzer on text intended for use with fuzzy queries, possibly disabling synonyms as well.这可能会导致相当多的混乱,因此,通常只对用于模糊查询的文本使用简单的分析器通常是有意义的,也可能禁用同义词。

Source: https://www.elastic.co/blog/found-fuzzy-search来源: https : //www.elastic.co/blog/found-fuzzy-search

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM