简体   繁体   English

从 R 中的文本中删除停用词

[英]Removing Stop Words From Text in R

I have a problem with removing stop_words from text data.我在从文本数据中删除 stop_words 时遇到问题。 The data set is web scraped and contains customer reviews and looks like:该数据集是网络抓取的,包含客户评论,如下所示:

data$Review <- c("Won't let me use my camera", "Does not load","I'ts truly mind blowing!")

I did the bellow data manipulation, and created a new variable in the data frame and now reviews look like this:我进行了以下数据操作,并在数据框中创建了一个新变量,现在评论看起来像这样:

Manipulation Part: 
data$Proc_Review <- gsub("'", "", data$Review) # Removes Apostrophes white spaces
data$Proc_Review <-  gsub('[[:punct:] ]+',' ',data$Proc_Review) # Remove Punctuation 
data$Proc_Review <- gsub('[[:digit:]]+', '', data$Proc_Review) # Remove numbers
data$Proc_Review <- as.character(data$Proc_Review)
"wont let me use my camera", "does not load", "its truly mind blowing"

The next step is to remove stop words, for which I use the bellow code:下一步是删除停用词,为此我使用以下代码:

    data("stop_words")

j<-1
for (j in 1:nrow(data)) {
  description<-  anti_join((data[j,] %>% unnest_tokens(word,Proc_Review, drop=FALSE,to_lower=FALSE) ),stop_words)
  data[j,"Proc_Review"]<-paste((description),collapse = " ")
}

After that the output is之后的输出是

c(1, 1) c(17304, 17304) c(\"Won't let me use my camera\", \"Won't let me use my camera\") c(1, 1) c(1, 1) c(32, 32) c(4, 4) c(\"wont let me use my camera\", \"wont let me use my camera\") c(\"wont\", \"camera\")"

I have tried some other ways, however, the result was not the wanted one, as it removed some stop_words from some reviews but not for all of them.我尝试了其他一些方法,但是,结果不是想要的,因为它从一些评论中删除了一些停用词,但不是针对所有评论。 For example, it removed "it's" in some reviews, but in some "it's" remained.例如,它在某些评论中删除了“it's”,但在某些评论中保留了“it's”。

What I want to do is reviews to appear in a new column in the data set without the stop words!我想要做的是评论出现在数据集中的新列中,没有停用词! Thank you so much in advance!!非常感谢你!

There is no need to use a for loop.无需使用for循环。 Additionally there was a bug in your data processing.此外,您的数据处理中存在错误。 In steps 2 and 3 you use the original vector.在第 2 步和第 3 步中,您使用原始向量。 Hence all processing you did in previous steps get overwritten.因此,您在前面的步骤中所做的所有处理都会被覆盖。

library(tidytext)
library(dplyr)

data("stop_words")

df <- data.frame(
  Review = c("Won't let me use my camera", "Does not load","I'ts truly mind blowing!")
)

df$Proc_Review <- gsub("\\'", "", df$Review) # Removes Apostrophes white spaces
df$Proc_Review <-  gsub('[[:punct:] ]+',' ',df$Proc_Review) # Remove Punctuation 
df$Proc_Review <- gsub('[[:digit:]]+', '', df$Proc_Review) # Remove numbers
df$Proc_Review <- as.character(df$Proc_Review)

df %>%
  unnest_tokens(word, Proc_Review, drop = FALSE, to_lower = FALSE)  %>%
  anti_join(stop_words)
#> Joining, by = "word"
#>                       Review               Proc_Review    word
#> 1 Won't let me use my camera Wont let me use my camera    Wont
#> 2 Won't let me use my camera Wont let me use my camera  camera
#> 3              Does not load             Does not load    Does
#> 4              Does not load             Does not load    load
#> 5   I'ts truly mind blowing!   Its truly mind blowing      Its
#> 6   I'ts truly mind blowing!   Its truly mind blowing     mind
#> 7   I'ts truly mind blowing!   Its truly mind blowing  blowing

Created on 2022-06-04 by the reprex package (v2.0.1)reprex 包(v2.0.1)于 2022-06-04 创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM