简体   繁体   中英

Stemming function in r

Package corpus provides a custom stemming function. The stemming function should, when given a term as an input, return the stem of the term as the output.

From Stemming Words I taken the following example, that uses the hunspell dictionary to do the stemming.

First I define the sentences on which to test this function:

sentences<-c("The color blue neutralizes orange yellow reflections.", 
             "Zod stabbed me with blue Kryptonite.", 
             "Because blue is your favourite colour.",
             "Red is wrong, blue is right.",
             "You and I are going to yellowstone.",
             "Van Gogh looked for some yellow at sunset.",
             "You ruined my beautiful green dress.",
             "You do not agree.",
             "There's nothing wrong with green.")

The custom stemming function is:

stem_hunspell <- function(term) {
  # look up the term in the dictionary
  stems <- hunspell::hunspell_stem(term)[[1]]

  if (length(stems) == 0) { # if there are no stems, use the original term
    stem <- term
  } else { # if there are multiple stems, use the last one
    stem <- stems[[length(stems)]]
  }

  stem
}

This code

sentences=text_tokens(sentences, stemmer = stem_hunspell)

produces:

> sentences
[[1]]
[1] "the"        "color"      "blue"       "neutralize" "orange"     "yellow"    
[7] "reflection" "."         

[[2]]
[1] "zod"        "stabbed"    "me"         "with"       "blue"       "kryptonite"
[7] "."         

[[3]]
[1] "because"   "blue"      "i"         "your"      "favourite" "colour"   
[7] "."        

[[4]]
[1] "re"    "i"     "wrong" ","     "blue"  "i"     "right" "."    

[[5]]
[1] "you"         "and"         "i"           "are"         "go"         
[6] "to"          "yellowstone" "."          

[[6]]
[1] "van"    "gogh"   "look"   "for"    "some"   "yellow" "at"     "sunset" "."     

[[7]]
[1] "you"       "ruin"      "my"        "beautiful" "green"     "dress"    
[7] "."        

[[8]]
[1] "you"   "do"    "not"   "agree" "."    

[[9]]
[1] "there"   "nothing" "wrong"   "with"    "green"   "." 

After stemming I would like to apply other operations on the text, eg removing stop words. Anyway, when I applied the tm -function:

removeWords(sentences,stopwords)

to my sentences, I obtained the following error:

Error in UseMethod("removeWords", x) : 
 no applicable method for 'removeWords' applied to an object of class "list"

If I use

unlist(sentences)

I don't get the desired result, because I end up with a chr of 65 elements. The desired result should be (eg for the the first sentences):

"the color blue neutralize orange yellow reflection."       

If you want to remove stopwords from each sentence , you could use lapply :

library(tm)
lapply(sentences, removeWords, stopwords())

#[[1]]
#[1] ""           "color"      "blue"       "neutralize" "orange"     "yellow"     "reflection" "."         

#[[2]]
#[1] "zod"        "stabbed"    ""           ""           "blue"       "kryptonite" "."  
#...
#...

However, from your expected output it looks you want to paste the text together.

lapply(sentences, paste0, collapse = " ")

#[[1]]
#[1] "the color blue neutralize orange yellow reflection ."

#[[2]]
#[1] "zod stabbed me with blue kryptonite ."
#....

We can use map

library(tm)
library(purrr)
map(sentences, removeWords, stopwords())
#[[1]]
#[1] ""           "color"      "blue"       "neutralize" "orange"     "yellow"     "reflection"
#[8] "."         

#[[2]]
#[1] "zod"        "stabbed"    ""           ""           "blue"       "kryptonite" "."     

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM