[英]Solr stop words not seem to work , stop words are removed while indexing but still it at query time the stopwords are not removed in proximity search
I am using solr 8.2.0.我正在使用 solr 8.2.0。 I am trying to configure proximity search in my solr but it doesnt seem to remove the stopwords in query.
我正在尝试在我的 solr 中配置邻近搜索,但它似乎没有删除查询中的停用词。
<fieldType name="psearch" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
</fieldType>
I have mentioned the stopwords in stopwords.txt file in the directory, at the index time solr is removing the words as you can see in the picture: indexed terms我已经提到了目录中 stopwords.txt 文件中的停用词,在索引时间 solr 正在删除您在图片中看到的单词: indexed terms
I also checked it in the analysis tab overthere the stopwords are being removed Analysis tab我还在分析选项卡中检查了它,停用词正在被删除分析选项卡
And here is the field:这里是领域:
<field name="pSearchField" type="psearch" indexed="true" stored="true" multiValued="false" />
<copyField source="example" dest="pSearchField"/>
And when I set the proximity to 1 or 2 or 3 it returns no result: result当我将接近度设置为 1 或 2 或 3 时,它不返回任何结果:结果
This is a known problem with Solr 5 and up, since it no longer rewrites the position for each token when the stopfilter is invoked.这是 Solr 5 及更高版本的一个已知问题,因为当调用停止过滤器时,它不再为每个令牌重写 position。 This issue, with a few suggestions of how to fix it, is tracked in SOLR-6468 .
SOLR-6468中跟踪了这个问题,并提供了一些解决方法的建议。
The easiest solution is to introduce a mapping char filter factory , but I'm skeptical to it changing characters internally in a string.最简单的解决方案是引入一个映射字符过滤器工厂,但我对它在字符串内部更改字符持怀疑态度。 (ie
"to" => ""
also affecting veto
and not just to
). (即
"to" => ""
也影响veto
而不仅仅是to
)。 This can possible be handled with multiple PatternReplaceCharFilterFactories instead.这可以用多个PatternReplaceCharFilterFactories来处理。
Another option shown in the thread for the ticket is to use a custom filter that rewrites the position data for each token:票证线程中显示的另一个选项是使用自定义过滤器,该过滤器为每个令牌重写 position 数据:
package filters;
import java.io.IOException;
import java.util.Map;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.util.TokenFilterFactory;
public class RemoveTokenGapsFilterFactory extends TokenFilterFactory {
public RemoveTokenGapsFilterFactory(Map<String, String> args) {
super(args);
}
@Override
public TokenStream create(TokenStream input) {
RemoveTokenGapsFilter filter = new RemoveTokenGapsFilter(input);
return filter;
}
}
final class RemoveTokenGapsFilter extends TokenFilter {
private final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);
public RemoveTokenGapsFilter(TokenStream input) {
super(input);
}
@Override
public final boolean incrementToken() throws IOException {
while (input.incrementToken()) {
posIncrAtt.setPositionIncrement(1);
return true;
}
return false;
}
}
There currently is no perfect, built-in solution to this issue as far as I know.据我所知,目前还没有完美的内置解决方案来解决这个问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.