如何使用R中的tm包从语料库中删除重复项

Question

I am trying to remove duplicates from a corpus using the tm package in R. For example, to remove ampersands, I use the following R statements: 我试图使用R中的tm包从语料库中删除重复项。例如，要删除＆符，我使用以下R语句：

removeAmp <- function(x) gsub("&amp\;", "", x)

myCorpus <- tm_map(myCorpus, removeAmp)

I then try to remove duplicates using the following: 然后，我尝试使用以下方法删除重复项：

removeDup <- function(x) unique(x)

myCorpus <- tm_map(myCorpus, removeDup)

I get the error message: 我收到错误消息：

Error in match.fun(FUN) : argument "FUN" is missing, with no default match.fun（FUN）中的错误：参数“ FUN”丢失，没有默认值

I have also tried 我也尝试过

removeDup <- function(x) as.list(unique(unlist(x)))

but still get an error. 但仍然出现错误。 Any help would be very much appreciated. 任何帮助将不胜感激。

Answer 1

Removing duplicated entries can be done with the following code. 可以使用以下代码删除重复的条目。

First, convert the previously cleaned corpus back to a data frame. 首先，将先前清理的语料库转换回数据框。

df.tweets<-data.frame(text=unlist(sapply(tweet.corpus, `[`,"content")), stringsAsFactors=F)

Second, remove duplicates entries in the data frame 其次，删除数据框中的重复项

tweets.out.unique <- unique(df.tweets)

Third, convert it back to the a corpus (if needed) (assuming that the dataframe has one colum) 第三，将其转换回一个语料库（如果需要）（假设数据框有一个列）

tweet.corpus.clean <- Corpus(DataframeSource(tweets.out.unique[1]))

I don't know if this is more elegant, but quite easy! 我不知道这是否更优雅，但是很简单！

Answer 2

This worked for me: 这为我工作：

clean.corpus <- function(corpus) {
      #remove "mc.cores=1" for windows! (Only necessary for Macintosh)
      removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
      myStopwords <- c(stopwords(use.stopwords), "twitter", "tweets","tweet","tweeting", "retweet", "followme", "account", "available", "via")
      myStopwords <- c(myStopwords, "melinafollowme", "voten", "samier", "zsm", "hpa", "geraus", "vote", "gevotet", "dagibee", "berlin")
      myStopwords <- c(myStopwords, "mal","dass", "für", "votesami", "votedagi", "vorhersage", "\u2728\u2728\u2728\u2728\u2728", "\u2728\u2728\u2728")
     cleaned.corpus <- tm_map(corpus, stripWhitespace, lazy=TRUE)
     cleaned.corpus <- tm_map(cleaned.corpus, content_transformer(tolower), mc.cores=1)
     cleaned.corpus <- tm_map(cleaned.corpus, content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')), lazy=TRUE)
     cleaned.corpus <- tm_map(cleaned.corpus, removePunctuation, lazy=TRUE)
     cleaned.corpus <- tm_map(cleaned.corpus, removeNumbers, lazy=TRUE) 
     cleaned.corpus <- tm_map(cleaned.corpus, removeURL)
     cleaned.corpus <- tm_map(cleaned.corpus, function(x) removeWords(x, myStopwords), mc.cores=1);

     cleaned.corpus <- tm_map(cleaned.corpus,          
     function(x)removeWords(x,stopwords(use.stopwords)), mc.cores=1)

     removeDup <- function(x) unique(x)
     cleaned.corpus <- tm_map(cleaned.corpus, removeDup, mc.cores=1)

     cleaned.corpus <- tm_map(cleaned.corpus, PlainTextDocument)
     return (cleaned.corpus)
 }

如何使用R中的tm包从语料库中删除重复项

问题描述

2 个解决方案

解决方案1
1 2015-03-25 22:00:03

解决方案2
0 2015-04-14 20:37:39

如何使用R中的tm包从语料库中删除重复项

问题描述

2 个解决方案

解决方案1 1 2015-03-25 22:00:03

解决方案2 0 2015-04-14 20:37:39

解决方案1
1 2015-03-25 22:00:03

解决方案2
0 2015-04-14 20:37:39