简体   繁体   中英

Word stemming in R

I am working on a text mining project and trying to clean the text - words in singular/plural forms, verbs in different tenses and misspelling words. My sample looks like this:

test <- c("apple","apples","wife","wives","win","won","winning","winner","orange","oranges","orenge")

I tried to use the wordStem function in SnowballC package. However the results are wrong:

"appl"   "appl"   "wife"   "wive"   "win"    "won"    "win"    "winner" "orang"  "orang"  "oreng" 

What I would like to see is:

"apple"   "apple"   "wife"   "wife"   "win"    "win"    "win"    "winner" "orange"  "orange"  "orange"

That is just how the Porter Stemmer works. The reason for this is that it allows fairly simple rules to create the stems without having to store a large English vocabulary. For example, I think that you would not like that both change and changing go to chang . It seems more natural that they should both stem to change . So would you make a rule that if you take ing off the end of a word, you should add back e to get the stem? Then what would happen with clang and clanging ? The Porter Stemmer gives clang . Adding e would give the non-word clange . Either you use simple processing rules that sometimes create stems that are not words, or you must include a large vocabulary and have more complex rules that depend on what the words are. The Porter Stemmer uses the simple rules method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM