简体   繁体   English

如何使用R中的tm包从语料库中删除重复项

[英]How to remove duplicates from a corpus using the tm package in R

I am trying to remove duplicates from a corpus using the tm package in R. For example, to remove ampersands, I use the following R statements: 我试图使用R中的tm包从语料库中删除重复项。例如,要删除&符,我使用以下R语句:

removeAmp <- function(x) gsub("&amp\;", "", x)

myCorpus <- tm_map(myCorpus, removeAmp)

I then try to remove duplicates using the following: 然后,我尝试使用以下方法删除重复项:

removeDup <- function(x) unique(x)

myCorpus <- tm_map(myCorpus, removeDup)

I get the error message: 我收到错误消息:

Error in match.fun(FUN) : argument "FUN" is missing, with no default match.fun(FUN)中的错误:参数“ FUN”丢失,没有默认值

I have also tried 我也尝试过

removeDup <- function(x) as.list(unique(unlist(x)))

but still get an error. 但仍然出现错误。 Any help would be very much appreciated. 任何帮助将不胜感激。

Removing duplicated entries can be done with the following code. 可以使用以下代码删除重复的条目。

First, convert the previously cleaned corpus back to a data frame. 首先,将先前清理的语料库转换回数据框。

df.tweets<-data.frame(text=unlist(sapply(tweet.corpus, `[`,"content")), stringsAsFactors=F)

Second, remove duplicates entries in the data frame 其次,删除数据框中的重复项

tweets.out.unique <- unique(df.tweets)

Third, convert it back to the a corpus (if needed) (assuming that the dataframe has one colum) 第三,将其转换回一个语料库(如果需要)(假设数据框有一个列)

tweet.corpus.clean <- Corpus(DataframeSource(tweets.out.unique[1]))

I don't know if this is more elegant, but quite easy! 我不知道这是否更优雅,但是很简单!

This worked for me: 这为我工作:

clean.corpus <- function(corpus) {
      #remove "mc.cores=1" for windows! (Only necessary for Macintosh)
      removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
      myStopwords <- c(stopwords(use.stopwords), "twitter", "tweets","tweet","tweeting", "retweet", "followme", "account", "available", "via")
      myStopwords <- c(myStopwords, "melinafollowme", "voten", "samier", "zsm", "hpa", "geraus", "vote", "gevotet", "dagibee", "berlin")
      myStopwords <- c(myStopwords, "mal","dass", "für", "votesami", "votedagi", "vorhersage", "\u2728\u2728\u2728\u2728\u2728", "\u2728\u2728\u2728")
     cleaned.corpus <- tm_map(corpus, stripWhitespace, lazy=TRUE)
     cleaned.corpus <- tm_map(cleaned.corpus, content_transformer(tolower), mc.cores=1)
     cleaned.corpus <- tm_map(cleaned.corpus, content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')), lazy=TRUE)
     cleaned.corpus <- tm_map(cleaned.corpus, removePunctuation, lazy=TRUE)
     cleaned.corpus <- tm_map(cleaned.corpus, removeNumbers, lazy=TRUE) 
     cleaned.corpus <- tm_map(cleaned.corpus, removeURL)
     cleaned.corpus <- tm_map(cleaned.corpus, function(x) removeWords(x, myStopwords), mc.cores=1);

     cleaned.corpus <- tm_map(cleaned.corpus,          
     function(x)removeWords(x,stopwords(use.stopwords)), mc.cores=1)

     removeDup <- function(x) unique(x)
     cleaned.corpus <- tm_map(cleaned.corpus, removeDup, mc.cores=1)

     cleaned.corpus <- tm_map(cleaned.corpus, PlainTextDocument)
     return (cleaned.corpus)
 }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM