简体   繁体   English

使用dplyr删除停用词

[英]Removing stop words with dplyr

Reading http://tidytextmining.com/tidytext.html states : 阅读http://tidytextmining.com/tidytext.html指出:

"

Often in text analysis, we will want to remove stop words; 通常在文本分析中,我们希望删除停用词; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. 停用词是对分析无用的词,通常是英文中非常常见的词,例如“ the”,“ of”,“ to”等。 We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join(). 我们可以使用anti_join()删除停用词(保留在整洁文本数据集stop_words中)。

data(stop_words) 数据(stop_words)

tidy_books <- tidy_books %>% anti_join(stop_words) tidy_books <-tidy_books%>%anti_join(停用词)

"

I'm attempting to modify to remove stop words from a string : 我正在尝试修改以从字符串中删除停用词:

data(stop_words)
str_v <- paste(c("this is a test"))
str_v <- str_v %>%
  anti_join(stop_words)

but returns error : 但返回错误:

Error in UseMethod("anti_join") : 
  no applicable method for 'anti_join' applied to an object of class "character"

Do need to convert str_v to class that contains method anti_join ? 是否需要将str_v转换为包含方法anti_join类?

The str_v is a vector. str_v是一个向量。 It needs to be converted to a data.frame or tibble using as.tibble , then with unnest_tokens the 'value' column is split up into words while renaming it as 'word', so that when we do the anti_join the common columns match up and join by 'word' 它需要被转换为data.frame或tibble使用as.tibble ,然后用unnest_tokens的“价值”列被分成词语而重命名为“单词”,这样,当我们做anti_join共用列匹配并通过“单词”加入

library(tidytext)
library(tibble)
library(dplyr)
str_v %>%
    as.tibble %>% 
    unnest_tokens(word, value) %>%
    anti_join(stop_words)
# A tibble: 1 x 1
#   word
#  <chr>
#1  test

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM