简体   繁体   中英

Removing Stop Words From Text in R

I have a problem with removing stop_words from text data. The data set is web scraped and contains customer reviews and looks like:

data$Review <- c("Won't let me use my camera", "Does not load","I'ts truly mind blowing!")

I did the bellow data manipulation, and created a new variable in the data frame and now reviews look like this:

Manipulation Part: 
data$Proc_Review <- gsub("'", "", data$Review) # Removes Apostrophes white spaces
data$Proc_Review <-  gsub('[[:punct:] ]+',' ',data$Proc_Review) # Remove Punctuation 
data$Proc_Review <- gsub('[[:digit:]]+', '', data$Proc_Review) # Remove numbers
data$Proc_Review <- as.character(data$Proc_Review)
"wont let me use my camera", "does not load", "its truly mind blowing"

The next step is to remove stop words, for which I use the bellow code:

    data("stop_words")

j<-1
for (j in 1:nrow(data)) {
  description<-  anti_join((data[j,] %>% unnest_tokens(word,Proc_Review, drop=FALSE,to_lower=FALSE) ),stop_words)
  data[j,"Proc_Review"]<-paste((description),collapse = " ")
}

After that the output is

c(1, 1) c(17304, 17304) c(\"Won't let me use my camera\", \"Won't let me use my camera\") c(1, 1) c(1, 1) c(32, 32) c(4, 4) c(\"wont let me use my camera\", \"wont let me use my camera\") c(\"wont\", \"camera\")"

I have tried some other ways, however, the result was not the wanted one, as it removed some stop_words from some reviews but not for all of them. For example, it removed "it's" in some reviews, but in some "it's" remained.

What I want to do is reviews to appear in a new column in the data set without the stop words! Thank you so much in advance!!

There is no need to use a for loop. Additionally there was a bug in your data processing. In steps 2 and 3 you use the original vector. Hence all processing you did in previous steps get overwritten.

library(tidytext)
library(dplyr)

data("stop_words")

df <- data.frame(
  Review = c("Won't let me use my camera", "Does not load","I'ts truly mind blowing!")
)

df$Proc_Review <- gsub("\\'", "", df$Review) # Removes Apostrophes white spaces
df$Proc_Review <-  gsub('[[:punct:] ]+',' ',df$Proc_Review) # Remove Punctuation 
df$Proc_Review <- gsub('[[:digit:]]+', '', df$Proc_Review) # Remove numbers
df$Proc_Review <- as.character(df$Proc_Review)

df %>%
  unnest_tokens(word, Proc_Review, drop = FALSE, to_lower = FALSE)  %>%
  anti_join(stop_words)
#> Joining, by = "word"
#>                       Review               Proc_Review    word
#> 1 Won't let me use my camera Wont let me use my camera    Wont
#> 2 Won't let me use my camera Wont let me use my camera  camera
#> 3              Does not load             Does not load    Does
#> 4              Does not load             Does not load    load
#> 5   I'ts truly mind blowing!   Its truly mind blowing      Its
#> 6   I'ts truly mind blowing!   Its truly mind blowing     mind
#> 7   I'ts truly mind blowing!   Its truly mind blowing  blowing

Created on 2022-06-04 by the reprex package (v2.0.1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM