简体   繁体   English

在 Solr 中使用不同的语言停用词

[英]Using different language stop words in Solr

Solr provides some data type out of box in managed schema for different languages such as English, French, Japanese etc. Solr 在托管模式中为不同语言(如英语、法语、日语等)提供了一些现成的数据类型。

We are using common data type "text_general" for fields declaration and using stopwards.txt for stopword filtering.我们使用通用数据类型“text_general”进行字段声明,并使用 stopwards.txt 进行停用词过滤。

    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="1"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

While sycing data to Solr core we are importing different languages text in the fields such as french, english, german etch.在将数据同步到 Solr 内核时,我们正在导入不同语言的文本,例如法语、英语、德语 etch。

My query is shall we use all different language stopwords into same "stopwards.txt" file or how solr use different language stopwords?我的问题是我们应该在同一个“stopwards.txt”文件中使用所有不同的语言停用词还是 solr 如何使用不同的语言停用词?

Do not remove stop words.不要删除停用词。 Stop word removal is a disk space saving hack left over from 32-bit machines in the 1970s.停用词删除是 1970 年代 32 位机器遗留下来的一种磁盘空间节省技巧。

I've never removed stop words and I started working in search 25 years ago at Infoseek (which did not remove stop words).我从来没有删除停用词,25 年前我开始在 Infoseek 从事搜索工作(它没有删除停用词)。

Removing them from the index makes some queries impossible, like "vitamin a".从索引中删除它们会使某些查询变得不可能,例如“维生素 a”。 When I was building search at Netflix, I accidentally left the stop word removal configured and found a whole set of movie titles that were 100% stop words.当我在 Netflix 构建搜索时,我不小心配置了停用词删除,并发现了一整套 100% 停用词的电影标题。 That list is in this blog post.该列表在此博客文章中。

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/ https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

The "idf" score in a tf.idf system like Solr does the same job as stop words, but better.像 Solr 这样的 tf.idf 系统中的“idf”分数与停用词的作用相同,但效果更好。 It gives common words a lower score based on the statistics of this particular collection.它根据这个特定集合的统计数据给常用词一个较低的分数。

Do not remove stop words.不要删除停用词。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM