简体   繁体   English

为什么k均值将观察结果而不是R中的变量聚类?

[英]Why is k-means clustering the observations rather than the variables in R?

I have a dataset mydata with 84 variables each with 300 observations, as shown below: 我有一个数据集mydata其中有84个变量,每个变量有300个观察值,如下所示:

在此处输入图片说明

I am using the following code to cluster mydata in 5 clusters: 我正在使用以下代码将mydata集群到5个集群中:

mydata <- read.csv("mydata.csv", header = TRUE)

# K-Means Clustering with 5 clusters
fit <- kmeans(mydata, 5)

# Cluster Plot against first 2 principal components

# vary parameters for most readable graph
library(cluster) 
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

This produces the following plot: 这将产生以下图: 在此处输入图片说明

I expected it to plot the 84 variables with their names, as shown in the first image, based on the observations. 我期望它根据观察结果绘制出84个变量及其名称,如第一幅图所示。 But instead, as can be seen in the last image, it is clustering the 300 observations. 但是,正如从上一张图像中可以看到的那样,它正在对300个观测值进行聚类。 How to fix this? 如何解决这个问题?

I tried transposing mydata , but that doesn't solve the issue. 我尝试转置mydata ,但这不能解决问题。

EDIT: I expected it to plot something like this (but this plot is for another dataset). 编辑:我希望它可以绘制这样的东西(但此图用于另一个数据集)。 I show this plot only to show the names of the variables on the plot, which means the variables are getting plotted (based on observations). 我仅显示此图是为了显示图中的变量名称,这意味着将对变量进行绘制(基于观察结果)。 在此处输入图片说明

If you want to cluster variables, not instances , you can simply transpose your data matrix. 如果要聚类变量而不是实例 ,则可以简单地转置数据矩阵。

Usually, clustering is applied to data points, not columns. 通常,群集应用于数据点,而不是列。

Beware of the usual limitations of k-means. 提防k均值的通常限制。 It is very sensitive to scale. 它对规模非常敏感。

The plot you mention had probably been created by using the mtcars dataset: 您提到的图可能是使用mtcars数据集创建的:

print(datasets::mtcars)

The points you see are clearly observations. 您看到的要点显然是观察结果。

If you want to create a cluster of variables, multiple options exists: 如果要创建变量集群,则存在多个选项:

  • Create a matrix of distance between your variables, for exemple with cor and do a hierarchical clustering with hclust . 创建变量之间的距离矩阵,以cor为例,并使用hclust进行层次聚类。
  • Do a pca then cluster the projection of your variable on the created components. 执行pca,然后将变量的投影聚集在创建的组件上。 That way, you can use kmeans to cluster your variables. 这样,您可以使用kmeans对变量进行聚类。
  • If all your variables are numeric, you can transpose your dataframe and do a k-mean clustering. 如果所有变量都是数字变量,则可以转置数据框并进行k均值聚类。

Also, the question of why you want to cluster the variables probably need some more thought. 另外,为什么要对变量进行聚类的问题可能需要更多考虑。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM