简体   繁体   English

从 R 中的字符串列表中删除停用词

[英]Removing Stop words from a list of strings in R

Sample data样本数据

Dput code of my data我的数据的输入代码

  x <-  structure(list(Comments = structure(2:1, .Label = c("I have a lot of home-work to be completed..", 
    "I want to vist my teacher today only!!"), class = "factor"), 
        Comment_ID = c(704, 802)), class = "data.frame", row.names = c(NA, 
    -2L))

I want to remove the stop words from the above data set using tidytext::stop_words$word and also retain the same columns in the output.我想使用tidytext::stop_words$word从上述数据集中删除停用词,并在 output 中保留相同的列。 Along with this how can I remove punctuation in tidytext package?除此之外,如何删除tidytext package 中的标点符号?

Note: I don't want to change my dataset into corpus注意:我不想将我的数据集更改为语料库

You can collapse all the words in tidytext::stop_words$word into one regex adding word boundaries.您可以将tidytext::stop_words$word中的所有单词折叠成一个添加单词边界的正则表达式。 However, tidytext::stop_words$word is of length 1149 and this might be too big for regex to handle so you can remove few words which are not needed and apply this.但是, tidytext::stop_words$word的长度为 1149,这对于正则表达式来说可能太大而无法处理,因此您可以删除一些不需要的单词并应用它。

For example taking only first 10 words from tidytext::stop_words$word , you can do:例如,仅从tidytext::stop_words$word中获取前 10 个单词,您可以执行以下操作:

gsub(paste0(paste0('\\b', tidytext::stop_words$word[1:10], '\\b', 
     collapse = "|"), '|[[:punct:]]+'), '', x$Comments)


#[1] "I want to vist my teacher today only"    
#    "I have  lot of homework to be completed"
clean_tweet = removeWords(clean_tweet, stopwords("english"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM