删除向量中除单词外的所有单词

Question

从文本或字符向量中删除停用词是很常见的。 我从tm包中使用了removeWords函数。

但是，我正在尝试删除除停用词以外的所有单词。 我列出了一个叫做x的单词。 当我使用

removeWords(text, x)

我收到此错误：

In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), PCRE pattern compilation error 'regular expression is too large'`

我也尝试过使用grep ：

grep(x, text)

但这是行不通的，因为x是向量而不是单个字符串。

那么，如何删除不在该向量中的所有单词？ 或者，如何只选择向量中的单词？

Answer 1

如果要将x用作grep的正则表达式模式，只需使用x <- paste(x, collapse = "|") ，它将允许您在text查找这些单词。 但是请记住，正则表达式可能仍然太大。 如果要删除不是 stopword()任何单词，则可以创建自己的函数：

keep_stopwords <- function(text) {
  stop_regex <- paste(stopwords(), collapse = "\\b|\\b")
  stop_regex <- paste("\\b", stop_regex, "\\b", sep = "")
  tmp <- strsplit(text, " ")[[1]]
  idx <- grepl(stop_regex, tmp)
  txt <- paste(tmp[idx], collapse = " ")
  return(txt)
}

text = "How much wood would a woodchuck if a woodchuck could chuck wood? More wood than most woodchucks would chuck if woodchucks could chuck wood, but less wood than other creatures like termites."
keep_stopwords(text)
# [1] "would a if a could than most would if could but than other"

基本上，我们只是将stopwords()设置为一个正则表达式，它将查找这些单词中的任何一个。 但是我们必须注意部分匹配，因此我们将每个停用词都包装在\\\\b以确保完全匹配。 然后，我们分割字符串，以便分别匹配每个单词，并创建作为停用词的单词的索引。 然后，我们将这些单词再次粘贴在一起，并将其作为单个字符串返回。

编辑

这是另一种方法，更简单易懂。 它还不依赖于正则表达式，这在大型文档中可能会很昂贵。

keep_words <- function(text, keep) {
  words <- strsplit(text, " ")[[1]]
  txt <- paste(words[words %in% keep], collapse = " ")
  return(txt)
}
x <- "How much wood would a woodchuck chuck if a woodchuck could chuck wood? More wood than most woodchucks would chuck if woodchucks could chuck wood, but less wood than other creatures like termites."
keep_words(x, stopwords())
# [1] "would a if a could than most could if a could but than other"

删除向量中除单词外的所有单词

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-02-05 22:52:05

编辑

删除向量中除单词外的所有单词

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-02-05 22:52:05

编辑

解决方案1
2 已采纳 2016-02-05 22:52:05