简体   繁体   English

如何从R中的文档术语矩阵中删除空文档

[英]How to remove the empty documents from the Document Term Matrix in R

I have got empty documents in my document term matrix. 我的文档术语矩阵中有空文档。 I need to remove them. 我需要删除它们。 This is code that I used to build the DocumentTermMatrix: 这是我用来构建DocumentTermMatrix的代码:

 tweets_dtm_tfidf <- DocumentTermMatrix(tweet_corpus, control = list(weighting = weightTfIdf))

And this the warning Message that I am getting: 这是我收到的警告消息:

Warning message:
In weighting(x) :
  empty document(s): 823 3795 4265 7252 7295 7425 8240 8433 9303 12160 12278 14465 15166 15485 15933 20775 21666 21807 26131 27039 34035 34050 34101

I tried removing these empty documents using this code: 我尝试使用以下代码删除这些空文档:

rowTotals <- apply(tweets_dtm_tfidf , 1, sum)
dtm_tfidf   <- tweets_dtm_tfidf[rowTotals> 0, ]

Here is the error that I am getting trying to remove them: 这是我尝试删除它们的错误:

> rowTotals <- apply(tweets_dtm_tfidf , 1, sum)

Error: cannot allocate vector of size 6.8 Gb

Any idea on how to go about this? 关于如何解决这个任何想法? Thanks for any suggestions in advance. 感谢您提前提出任何建议。

The sum in apply transforms your sparse matrix into a dense matrix and this eats up a lot of memory if it is a big sparse matrix. apply中的总和会将您的稀疏矩阵转换为密集矩阵,如果它是一个较大的稀疏矩阵,则会消耗大量内存。

And the apply function is not needed. 并且不需要apply函数。 There are functions for sparse matrices. 有用于稀疏矩阵的函数。 Since the dtm is a simple_triplet_matrix you can use the row_sums from slam. 由于dtm是simple_triplet_matrix您可以使用slam中的row_sums。

The following should work. 以下应该工作。

rowTotals <- slam::row_sums(tweets_dtm_tfidf)
dtm_tfidf <- dtm_tfidf[rowTotals > 0, ]

But remember anything you do to get your data out of sparse matrix might result in big memory hog object if you have a lot of words. 但是请记住,如果您有很多单词,为使数据从稀疏矩阵中取出而进行的任何操作都可能会导致内存消耗很大。 You might want to use removeSparseTerms before moving on. 您可能需要在removeSparseTerms之前使用removeSparseTerms

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM