繁体   English   中英

tm_map删除包含我的停用词的单词?

[英]tm_map to removewords containing my stop words?

我正在应用removeWords来过滤这样的语料库:

corpus <- Corpus(vs, readerControl = list(language="en")) 
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, removeWords, c(stopwords("english"))) 
corpus <- tm_map(corpus, removeWords, bannedWords$V1) 

但是,这仅是完全匹配的工作,因此:

  • f * ck被删除
  • f * cking未删除

如何删除包含停用词的词?

您可以使用词干将被禁词恢复为基本形式。 请参阅下面的示例。

library(tm)

banned <- c("buck")
text <- c("He is bucking the trend", "A buck is not worth a dollar anymore!")

corpus <- Corpus(VectorSource(text), readerControl = list(language="en")) 
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, removeWords, c(stopwords("english"), banned)) 

writeLines(as.character(corpus[[1]]))
  trend

如果您不阻止文档,您将获得:

corpus <- Corpus(VectorSource(text), readerControl = list(language="en")) 
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, removeWords, c(stopwords("english"), banned)) 

writeLines(as.character(corpus[[1]]))
  bucking  trend

我通过查看tm源代码中的removeWords函数并从以下项扩展正则表达式找到了答案:

gsub(sprintf("(*UCP)\\b(%s)\\b",

gsub(sprintf("(*UCP)\\b[a-zA-Z]*(%s)[a-zA-Z]*\\b",

完整的功能

removeWordsContaining <-
function(x, words)
    UseMethod("removeWordsContaining", x)
removeWordsContaining.character <-
function(x, words)
    gsub(sprintf("(*UCP)\\b[a-zA-Z]*(%s)[a-zA-Z]*\\b",
                 paste(sort(words, decreasing = TRUE), collapse = "|")),
         "", x, perl = TRUE)
removeWordsContaining.PlainTextDocument <-
    content_transformer(removeWordsContaining.character)

blog_corpus <- Corpus(vs, readerControl = list(language="en")) 
blog_corpus <- tm_map(blog_corpus, content_transformer(tolower))
blog_corpus <- tm_map(blog_corpus, stripWhitespace) 
blog_corpus <- tm_map(blog_corpus, removePunctuation) 
blog_corpus <- tm_map(blog_corpus, removeNumbers) 
blog_corpus <- tm_map(blog_corpus, removeWords, c(stopwords("english"))) 
blog_corpus <- tm_map(blog_corpus, removeWordsContaining, bannedWords$V1) 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM