简体   繁体   English

如何从TDM中删除NA列以进行集群

[英]how to remove NA columns from TDM for clustering

I'm struggling with TDM NA values to commit the clustering. 我正在努力使用TDM NA值来提交群集。 Initially I've set: 最初,我设置为:

titles.tdm <- as.matrix(TermDocumentMatrix(titles.cw, control = list(bounds = list(global = c(10,Inf)))))

titles.sc <- scale(na.omit(titles.tdm))

and got matrix of 418 terms and 6955 documents. 并得到418个术语和6955个文档的矩阵。 At this point executing: titles.km <- kmeans(titles.sc, 2) throws Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) 此时执行: titles.km <- kmeans(titles.sc, 2)抛出Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

When I've decided to remove those values by: 当我决定通过以下方式删除这些值时:

titles.sf <- titles.sc[,colSums(titles.sc) > 0]

I've got matrix of 4695 documents, but applying the kmeans function still throws this error. 我有4695个文档的矩阵,但是应用kmeans函数仍然会引发此错误。 When I've viewed the titles.sf variable there are still columns (docs) with NA values. 当我查看titles.sf变量时,仍然有带有NA值的列(文档)。 I'm messed up and don't know what doing wrong. 我搞砸了,不知道做错了什么。 How to remove those documents? 如何删除那些文件?

Earlier, I've applied titles.cw <- titles.cc[which(str_trim(titles.cc$content) != "")] where titles.cc is pure Corpus object from tm library class, to delete black documents. 之前,我已经将titles.cw <- titles.cc[which(str_trim(titles.cc$content) != "")]其中, titles.cctm库类中的纯语料库对象,用于删除黑色文档。 It probably worked, but my NA values are in documents which are not blank for sure. 可能有效,但是我的NA值在文档中肯定不是空白。

Here's some example data: 这是一些示例数据:

set.seed(123)
titles.sc <- matrix(1:25,5,5)
titles.sc[sample(length(titles.sc),5)]<-NA 
titles.sc
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1    6   11   16   21
# [2,]    2    7   12   17   NA
# [3,]    3   NA   13   18   23
# [4,]    4    9   14   NA   24
# [5,]    5   NA   15   NA   25

kmeans throws your error kmeans抛出错误

kmeans(titles.sc, 2)
# Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

because your column subsetting is probably not what you'd expect: 因为您的列子设置可能不是您期望的:

colSums(titles.sc) > 0
# [1] TRUE   NA TRUE   NA   NA

colSums produces NA , if missing values are not removed (check the help files under ?colSums ). 如果未删除缺少的值, colSums会产生NA (请检查?colSums下的帮助文件)。 Among other things, you could do 除其他事项外,您可以

colSums(is.na(titles.sc)) == 0
# [1]  TRUE FALSE  TRUE FALSE FALSE

or 要么

!is.na(colSums(titles.sc) > 0)
# [1]  TRUE FALSE  TRUE FALSE FALSE

And now, it works: 现在,它可以工作了:

titles.sf <- titles.sc[,colSums(is.na(titles.sc)) == 0,drop=F]
kmeans(titles.sf,2)
# K-means clustering with 2 clusters of sizes 2, 3
# ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM