从字符串中删除字符向量中的单词

Question

I have a character vector of stopwords in R: 我在R中有一个停用词的字符向量：

stopwords = c("a" ,
            "able" ,
            "about" ,
            "above" ,
            "abst" ,
            "accordance" ,
            ...
            "yourself" ,
            "yourselves" ,
            "you've" ,
            "z" ,
            "zero")

Let's say I have the string: 假设我有字符串：

str <- c("I have zero a accordance")

How can remove my defined stopwords from str ? 如何从str删除我定义的停用词？

I think gsub or another grep tool could be a good candidate to pull this off, although other recommendations are welcome. 我认为gsub或其他grep工具可能是一个很好的选择，尽管其他建议是受欢迎的。

Answer 1

Try this: 试试这个：

str <- c("I have zero a accordance")

stopwords = c("a", "able", "about", "above", "abst", "accordance", "yourself",
"yourselves", "you've", "z", "zero")

x <- unlist(strsplit(str, " "))

x <- x[!x %in% stopwords]

paste(x, collapse = " ")

# [1] "I have"

Addition: Writing a "removeWords" function is simple so it is not necessary to load an external package for this purpose: 增加：编写“removeWords”函数很简单，因此不需要为此目的加载外部包：

removeWords <- function(str, stopwords) {
  x <- unlist(strsplit(str, " "))
  paste(x[!x %in% stopwords], collapse = " ")
}

removeWords(str, stopwords)
# [1] "I have"

Answer 2

You could use the tm library for this: 您可以使用tm库：

require("tm")
removeWords(str,stopwords)
#[1] "I have   "

Answer 3

If stopwords is long, the removeWords() solution should be much faster than any regex based solution. 如果stopwords很长，则removeWords()解决方案应该比任何基于正则表达式的解决方案快得多。

For completeness, in case str is a vector of strings, one can write: 为了完整性，如果str是一个字符串向量，可以写：

library("magrittr")
library("stringr")
library("purrr")

remove_words <- function(x, .stopwords) {
  x %>%
    stringr::str_split(" ") %>%
    purrr::flatten_chr() %>%
    setdiff(.stopwords) %>%
    stringr::str_c(collapse = " ")
}
purrr::map_chr(str, remove_words, .stopwords = stopwords)

从字符串中删除字符向量中的单词

问题描述

3 个解决方案

解决方案1
15 2016-03-04 07:54:08

解决方案2
15 已采纳 2016-03-04 08:06:29

解决方案3
0 2019-05-02 17:16:32

从字符串中删除字符向量中的单词

问题描述

3 个解决方案

解决方案1 15 2016-03-04 07:54:08

解决方案2 15 已采纳 2016-03-04 08:06:29

解决方案3 0 2019-05-02 17:16:32

解决方案1
15 2016-03-04 07:54:08

解决方案2
15 已采纳 2016-03-04 08:06:29

解决方案3
0 2019-05-02 17:16:32