简体   繁体   English

如何配置 Solr 进行部分词匹配

[英]How to configure Solr to do partial word matching

Given the following set of values how do I configure the field to return values that are partial word matches but that also match the entire search term?鉴于以下一组值,我如何配置字段以返回部分单词匹配但也匹配整个搜索词的值?

Values:价值观:

Texas State University
Stanford University
St. Johns College

Desired results examples:期望结果示例:

Search Term: sta搜索词: sta

Desired Results:预期结果:

Texas State University
Stanford University

Search Term: stan搜索词: stan

Desired Results:预期结果:

Stanford University

Search Term: st un搜索词: st un

Desired Results:预期结果:

Texas State University
Stanford University

This is what I've tried so far:这是我迄今为止尝试过的:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
</fieldType>

I think my problem is with the EdgeNGramFilterFactory .我认为我的问题出在EdgeNGramFilterFactory As shown above, the second search for stan returns all three of the values shown instead of only Stanford .如上所示,对stan的第二次搜索返回显示的所有三个值,而不是仅返回Stanford But, without the EdgeNGramFilterFactory , partial words don't match at all.但是,如果没有EdgeNGramFilterFactory ,部分单词根本不匹配。

What is the correct configuration for a Solr field to return values that are partial word matches but that also match the entire search term? Solr 字段返回部分单词匹配但也匹配整个搜索词的值的正确配置是什么?

I think I figured it out.我想我想通了。 I definitely welcome other answers and additional corrections though.不过,我绝对欢迎其他答案和其他更正。

The solution appears to be to only use the EdgeNGramFilterFactory when indexing, not when querying.解决方案似乎是只在索引时使用EdgeNGramFilterFactory ,而不是在查询时。 This makes sense when you think about it.当你考虑它时,这是有道理的。 I want n-grams when indexing but only want to match the actual search term when querying.我在索引时想要 n-gram,但只想在查询时匹配实际的搜索词。

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

I had similar kind of requirment and tried this ... created different field Type...我有类似的要求并尝试了这个......创建了不同的字段类型......

<fieldType name="text_reference" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
      <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="50" side="front"/> 
      </analyzer>
      <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
  </fieldType>

I another another requirement... The below blog will explain it in detail我还有一个要求……下面的博客会详细解释

https://www.blogger.com/blogger.g?blogID=8592878860404675342#editor/target=post;postID=6309840933546641223;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=33;src=postname https://www.blogger.com/blogger.g?blogID=8592878860404675342#editor/target=post;postID=6309840933546641223;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=33;src=postname

You can use您可以使用

N-Gram Filter N-Gram 过滤器

Generates n-gram tokens of sizes in the given range.生成给定范围内大小的 n-gram 标记。 Note that tokens are ordered by position and then by gramsize.请注意,令牌按位置排序,然后按 gramize 排序。

Factory class:solr.NGramFilterFactory工厂类:solr.NGramFilterFactory

Arguments:参数:

minGramSize: (integer, default 1) The minimum gram size. minGramSize:(整数,默认 1)最小克大小。 maxGramSize: (integer, default 2) The maximum gram size. maxGramSize:(整数,默认 2)最大克大小。

Example:例子:

<analyzer>  
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.NGramFilterFactory"/>
</analyzer>

In: "four score"在:“四分”

Out: "f", "o", "u", "r", "fo", "ou", "ur", "s", "c", "o", "r", "e", "sc", "co", "or", "re"输出: "f", "o", "u", "r", "fo", "ou", "ur", "s", "c", "o", "r", "e", “sc”、“co”、“或”、“re”

http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.3.pdf#page=112&zoom=auto,-187,475 http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.3.pdf#page=112&zoom=auto,-187,475

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM