简体   繁体   中英

lucene stemmer strategy (does it keep both stemmed & non-stemmed words or just stemmed ones)

I have a question regarding lucene Stemmer. I was wondering if lucene keeps both stemmed words and non-stemmed words OR just replaces the stemmed word with the non-stemmed words?

for example if a record has following: "everyone loves cats" does it going to be indexed as "everyone loves love cats cat" OR "everyone love cat"

Does it have a same strategy for both query and records?

Generally, only the Stemmed version is kept. That is, in your example, the end result will be "everyone loves cat" rather than "everyone loves cat cats" or some similar combination.

You are expected to use the same stemmer both when indexing and querying. There may be some stemming filters that, like SynonymFilter , allow you to keep the original, but doing this and running unstemmed queries will tend to cause PhraseQueries not to work correctly (see the note in the SynonymFilter docs on this very topic). I don't believe most common stemming filters (ie. PorterStemFilter ) provide that functionality.

I you need to be able to search unstemmed data for some reason, I would recommend storing a second field that is entirely unstemmed for that purpose.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM