简体   繁体   中英

Using different language stop words in Solr

Solr provides some data type out of box in managed schema for different languages such as English, French, Japanese etc.

We are using common data type "text_general" for fields declaration and using stopwards.txt for stopword filtering.

    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="1"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

While sycing data to Solr core we are importing different languages text in the fields such as french, english, german etch.

My query is shall we use all different language stopwords into same "stopwards.txt" file or how solr use different language stopwords?

Do not remove stop words. Stop word removal is a disk space saving hack left over from 32-bit machines in the 1970s.

I've never removed stop words and I started working in search 25 years ago at Infoseek (which did not remove stop words).

Removing them from the index makes some queries impossible, like "vitamin a". When I was building search at Netflix, I accidentally left the stop word removal configured and found a whole set of movie titles that were 100% stop words. That list is in this blog post.

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

The "idf" score in a tf.idf system like Solr does the same job as stop words, but better. It gives common words a lower score based on the statistics of this particular collection.

Do not remove stop words.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM