从 R 中的字符串列表中删除停用词

Question

Sample data样本数据

Dput code of my data我的数据的输入代码

  x <-  structure(list(Comments = structure(2:1, .Label = c("I have a lot of home-work to be completed..", 
    "I want to vist my teacher today only!!"), class = "factor"), 
        Comment_ID = c(704, 802)), class = "data.frame", row.names = c(NA, 
    -2L))

I want to remove the stop words from the above data set using tidytext::stop_words$word and also retain the same columns in the output.我想使用tidytext::stop_words$word从上述数据集中删除停用词，并在 output 中保留相同的列。 Along with this how can I remove punctuation in tidytext package?除此之外，如何删除tidytext package 中的标点符号？

Note: I don't want to change my dataset into corpus注意：我不想将我的数据集更改为语料库

Answer 1

You can collapse all the words in tidytext::stop_words$word into one regex adding word boundaries.您可以将tidytext::stop_words$word中的所有单词折叠成一个添加单词边界的正则表达式。 However, tidytext::stop_words$word is of length 1149 and this might be too big for regex to handle so you can remove few words which are not needed and apply this.但是， tidytext::stop_words$word的长度为 1149，这对于正则表达式来说可能太大而无法处理，因此您可以删除一些不需要的单词并应用它。

For example taking only first 10 words from tidytext::stop_words$word , you can do:例如，仅从tidytext::stop_words$word中获取前 10 个单词，您可以执行以下操作：

gsub(paste0(paste0('\\b', tidytext::stop_words$word[1:10], '\\b', 
     collapse = "|"), '|[[:punct:]]+'), '', x$Comments)


#[1] "I want to vist my teacher today only"    
#    "I have  lot of homework to be completed"

Answer 2

clean_tweet = removeWords(clean_tweet, stopwords("english"))

从 R 中的字符串列表中删除停用词

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-06-24 13:24:19

解决方案2
0 2021-09-16 06:31:49

从 R 中的字符串列表中删除停用词

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-06-24 13:24:19

解决方案2 0 2021-09-16 06:31:49

解决方案1
1 已采纳 2020-06-24 13:24:19

解决方案2
0 2021-09-16 06:31:49