简体   繁体   English

Solr 停用词似乎不起作用,在索引时删除了停用词,但在查询时仍然没有在邻近搜索中删除停用词

[英]Solr stop words not seem to work , stop words are removed while indexing but still it at query time the stopwords are not removed in proximity search

I am using solr 8.2.0.我正在使用 solr 8.2.0。 I am trying to configure proximity search in my solr but it doesnt seem to remove the stopwords in query.我正在尝试在我的 solr 中配置邻近搜索,但它似乎没有删除查询中的停用词。

    <fieldType name="psearch" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.ClassicTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"  words="stopwords.txt" /> 
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.ClassicTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> 
  </analyzer>
</fieldType>

I have mentioned the stopwords in stopwords.txt file in the directory, at the index time solr is removing the words as you can see in the picture: indexed terms我已经提到了目录中 stopwords.txt 文件中的停用词,在索引时间 solr 正在删除您在图片中看到的单词: indexed terms

I also checked it in the analysis tab overthere the stopwords are being removed Analysis tab我还在分析选项卡中检查了它,停用词正在被删除分析选项卡

And here is the field:这里是领域:

<field name="pSearchField" type="psearch" indexed="true" stored="true" multiValued="false" />
    <copyField source="example" dest="pSearchField"/>

Searching with proximity接近搜索

And when I set the proximity to 1 or 2 or 3 it returns no result: result当我将接近度设置为 1 或 2 或 3 时,它不返回任何结果:结果

This is a known problem with Solr 5 and up, since it no longer rewrites the position for each token when the stopfilter is invoked.这是 Solr 5 及更高版本的一个已知问题,因为当调用停止过滤器时,它不再为每个令牌重写 position。 This issue, with a few suggestions of how to fix it, is tracked in SOLR-6468 . SOLR-6468中跟踪了这个问题,并提供了一些解决方法的建议。

The easiest solution is to introduce a mapping char filter factory , but I'm skeptical to it changing characters internally in a string.最简单的解决方案是引入一个映射字符过滤器工厂,但我对它在字符串内部更改字符持怀疑态度。 (ie "to" => "" also affecting veto and not just to ). (即"to" => ""也影响veto而不仅仅是to )。 This can possible be handled with multiple PatternReplaceCharFilterFactories instead.这可以用多个PatternReplaceCharFilterFactories来处理。

Another option shown in the thread for the ticket is to use a custom filter that rewrites the position data for each token:票证线程中显示的另一个选项是使用自定义过滤器,该过滤器为每个令牌重写 position 数据:

package filters;

import java.io.IOException;
import java.util.Map;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.util.TokenFilterFactory;

public class RemoveTokenGapsFilterFactory extends TokenFilterFactory {

    public RemoveTokenGapsFilterFactory(Map<String, String> args) {
        super(args);
    }

    @Override
    public TokenStream create(TokenStream input) {
        RemoveTokenGapsFilter filter = new RemoveTokenGapsFilter(input);
        return filter;
    }

}

final class RemoveTokenGapsFilter extends TokenFilter {

    private final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);

    public RemoveTokenGapsFilter(TokenStream input) {
        super(input);
    }

    @Override
    public final boolean incrementToken() throws IOException {
        while (input.incrementToken()) {
            posIncrAtt.setPositionIncrement(1);
            return true;
        }
        return false;
    }
}

There currently is no perfect, built-in solution to this issue as far as I know.据我所知,目前还没有完美的内置解决方案来解决这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM