简体   繁体   English

聚类变量

[英]Clustering Variables

What are some proven methods for finding groupings of highly correlated variables within a large, high-dimensional binary dataset (think 200,000+ rows and 150+ fields) that can be easily implemented in R? 有什么成熟的方法可以在R中轻松实现的大型高维二进制数据集(例如200,000+行和150+字段)中查找高度相关变量的分组? I want to find groupings of variables which lends itself to interpretation so I don't think PCA would be the best method. 我想找到便于解释的变量分组,所以我认为PCA并不是最好的方法。

    library(Hmisc)
mtc <- mtcars[,2:8]
    mtcn <- data.matrix(mtc)
    clust <- varclus(mtcn)
    clust
    plot(clust)

?varclus : Does a hierarchical cluster analysis on variables, using the Hoeffding D statistic, squared Pearson or Spearman correlations, or proportion of observations for which two variables are both positive as similarity measures. ?varclus :是否使用Hoeffding D统计量,平方Pearson或Spearman相关系数或两个变量均为正的观测值比例作为相似性度量,对变量进行层次聚类分析。 Variable clustering is used for assessing collinearity, redundancy, and for separating variables into clusters that can be scored as a single variable, thus resulting in data reduction. 变量聚类用于评估共线性,冗余度,以及将变量分为可计为单个变量的聚类,从而导致数据减少。

For Binary Vraibles: 对于二进制变量:

library(cluster)
data(animals)
ma <- mona(animals)
ma

plot(ma)  

?mona : Returns a list representing a divisive hierarchical clustering of a dataset with binary variables only. ?mona :返回一个列表,该列表表示仅具有二进制变量的数据集的划分性分层聚类。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM