[英]Removing Stop words from a list of strings in R
Sample data样本数据
Dput code of my data我的数据的输入代码
x <- structure(list(Comments = structure(2:1, .Label = c("I have a lot of home-work to be completed..",
"I want to vist my teacher today only!!"), class = "factor"),
Comment_ID = c(704, 802)), class = "data.frame", row.names = c(NA,
-2L))
I want to remove the stop words from the above data set using tidytext::stop_words$word
and also retain the same columns in the output.我想使用
tidytext::stop_words$word
从上述数据集中删除停用词,并在 output 中保留相同的列。 Along with this how can I remove punctuation in tidytext
package?除此之外,如何删除
tidytext
package 中的标点符号?
Note: I don't want to change my dataset into corpus注意:我不想将我的数据集更改为语料库
You can collapse all the words in tidytext::stop_words$word
into one regex adding word boundaries.您可以将
tidytext::stop_words$word
中的所有单词折叠成一个添加单词边界的正则表达式。 However, tidytext::stop_words$word
is of length 1149 and this might be too big for regex to handle so you can remove few words which are not needed and apply this.但是,
tidytext::stop_words$word
的长度为 1149,这对于正则表达式来说可能太大而无法处理,因此您可以删除一些不需要的单词并应用它。
For example taking only first 10 words from tidytext::stop_words$word
, you can do:例如,仅从
tidytext::stop_words$word
中获取前 10 个单词,您可以执行以下操作:
gsub(paste0(paste0('\\b', tidytext::stop_words$word[1:10], '\\b',
collapse = "|"), '|[[:punct:]]+'), '', x$Comments)
#[1] "I want to vist my teacher today only"
# "I have lot of homework to be completed"
clean_tweet = removeWords(clean_tweet, stopwords("english"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.