[英]how to remove NA columns from TDM for clustering
I'm struggling with TDM NA values to commit the clustering. 我正在努力使用TDM NA值来提交群集。 Initially I've set:
最初,我设置为:
titles.tdm <- as.matrix(TermDocumentMatrix(titles.cw, control = list(bounds = list(global = c(10,Inf)))))
titles.sc <- scale(na.omit(titles.tdm))
and got matrix of 418 terms and 6955 documents. 并得到418个术语和6955个文档的矩阵。 At this point executing:
titles.km <- kmeans(titles.sc, 2)
throws Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
此时执行:
titles.km <- kmeans(titles.sc, 2)
抛出Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
When I've decided to remove those values by: 当我决定通过以下方式删除这些值时:
titles.sf <- titles.sc[,colSums(titles.sc) > 0]
I've got matrix of 4695 documents, but applying the kmeans
function still throws this error. 我有4695个文档的矩阵,但是应用
kmeans
函数仍然会引发此错误。 When I've viewed the titles.sf
variable there are still columns (docs) with NA values. 当我查看
titles.sf
变量时,仍然有带有NA值的列(文档)。 I'm messed up and don't know what doing wrong. 我搞砸了,不知道做错了什么。 How to remove those documents?
如何删除那些文件?
Earlier, I've applied titles.cw <- titles.cc[which(str_trim(titles.cc$content) != "")]
where titles.cc
is pure Corpus object from tm
library class, to delete black documents. 之前,我已经将
titles.cw <- titles.cc[which(str_trim(titles.cc$content) != "")]
其中, titles.cc
是tm
库类中的纯语料库对象,用于删除黑色文档。 It probably worked, but my NA values are in documents which are not blank for sure. 可能有效,但是我的NA值在文档中肯定不是空白。
Here's some example data: 这是一些示例数据:
set.seed(123)
titles.sc <- matrix(1:25,5,5)
titles.sc[sample(length(titles.sc),5)]<-NA
titles.sc
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 6 11 16 21
# [2,] 2 7 12 17 NA
# [3,] 3 NA 13 18 23
# [4,] 4 9 14 NA 24
# [5,] 5 NA 15 NA 25
kmeans
throws your error kmeans
抛出错误
kmeans(titles.sc, 2)
# Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
because your column subsetting is probably not what you'd expect: 因为您的列子设置可能不是您期望的:
colSums(titles.sc) > 0
# [1] TRUE NA TRUE NA NA
colSums
produces NA
, if missing values are not removed (check the help files under ?colSums
). 如果未删除缺少的值,
colSums
会产生NA
(请检查?colSums
下的帮助文件)。 Among other things, you could do 除其他事项外,您可以
colSums(is.na(titles.sc)) == 0
# [1] TRUE FALSE TRUE FALSE FALSE
or 要么
!is.na(colSums(titles.sc) > 0)
# [1] TRUE FALSE TRUE FALSE FALSE
And now, it works: 现在,它可以工作了:
titles.sf <- titles.sc[,colSums(is.na(titles.sc)) == 0,drop=F]
kmeans(titles.sf,2)
# K-means clustering with 2 clusters of sizes 2, 3
# ...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.