[英]Clustering Variables
What are some proven methods for finding groupings of highly correlated variables within a large, high-dimensional binary dataset (think 200,000+ rows and 150+ fields) that can be easily implemented in R? 有什么成熟的方法可以在R中轻松实现的大型高维二进制数据集(例如200,000+行和150+字段)中查找高度相关变量的分组? I want to find groupings of variables which lends itself to interpretation so I don't think PCA would be the best method.
我想找到便于解释的变量分组,所以我认为PCA并不是最好的方法。
library(Hmisc)
mtc <- mtcars[,2:8]
mtcn <- data.matrix(mtc)
clust <- varclus(mtcn)
clust
plot(clust)
?varclus :
Does a hierarchical cluster analysis on variables, using the Hoeffding D statistic, squared Pearson or Spearman correlations, or proportion of observations for which two variables are both positive as similarity measures. ?varclus :
是否使用Hoeffding D统计量,平方Pearson或Spearman相关系数或两个变量均为正的观测值比例作为相似性度量,对变量进行层次聚类分析。 Variable clustering is used for assessing collinearity, redundancy, and for separating variables into clusters that can be scored as a single variable, thus resulting in data reduction. 变量聚类用于评估共线性,冗余度,以及将变量分为可计为单个变量的聚类,从而导致数据减少。
For Binary Vraibles: 对于二进制变量:
library(cluster)
data(animals)
ma <- mona(animals)
ma
plot(ma)
?mona :
Returns a list representing a divisive hierarchical clustering of a dataset with binary variables only. ?mona :
返回一个列表,该列表表示仅具有二进制变量的数据集的划分性分层聚类。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.