简体   繁体   English

如何从R中的文档术语矩阵中删除空文档

[英]how to remove empty documents from document term matrix in R

I am performing kmeans clustering for twitter data, for which I am cleaning the tweets and creating a corpus. 我正在对Twitter数据进行kmeans聚类,为此我正在清理推文并创建语料库。 Later I find the dtm and use the tf-idf theory. 后来我找到了dtm并使用了tf-idf理论。

But my dtm has few empty documents which I want to remove because kmeans can't run for empty docs. 但是我的dtm很少要删除空文档,因为kmeans不能运行空文档。

Here is my code: 这是我的代码:

removeURL <- function(x) gsub("http[[:alnum:][:punct:]]*", "", x) 
replacePunctuation <- function(x)
{
  x <- tolower(x)
  x <- gsub("[.]+[ ]"," ",x)
  x <- gsub("[:]+[ ]"," ",x)
  x <- gsub("[?]"," ",x)
  x <- gsub("[!]"," ",x)
  x <- gsub("[;]"," ",x)
  x <- gsub("[,]"," ",x)
  x <- gsub("[@]"," ",x)
  x <- gsub("[???]"," ",x)
  x <- gsub("[€]"," ",x)
  x

}

myStopwords <- c(stopwords('english'), "rt")


#preprocessing
tweet_corpus <- Corpus(VectorSource(tweet_raw$text))
tweet_corpus_clean <- tweet_corpus %>%
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeNumbers) %>%
  tm_map(removeWords,myStopwords) %>%
  tm_map(content_transformer(replacePunctuation)) %>%
  tm_map(stripWhitespace)%>%
  tm_map(content_transformer(removeURL))


dtm <- DocumentTermMatrix(tweet_corpus_clean ) 

#tf-idf

mat4 <- weightTfIdf(dtm) #when i run this, i get 2 docs that are empty
mat4 <- as.matrix(mat4)  

Obviously you can't do that with another tm_map . 显然,您不能使用另一个tm_map来做到这tm_map

But the text mining package also has tm_filter , which you can use to filter empty documents. 但是文本挖掘程序包还具有tm_filter ,您可以使用它来过滤空文档。

If your document does not contain any entry/word, then you could do this: 如果您的文档不包含任何条目/单词,则可以执行以下操作:

rowSumDoc <- apply(dtm, 1, sum) 
dtm2 <- dtm[rowSumDoc > 0, ] 

Basically, above we are summing the words in each document first. 基本上,上面我们首先对每个文档中的单词求和。 Later, we are subsetting dtm for documents that are not empty based on earlier summation of words in each document. 稍后,我们将根据每个文档中单词的早期总和为不为空的文档设置dtm

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM