[英]Why is k-means clustering the observations rather than the variables in R?
I have a dataset mydata
with 84 variables each with 300 observations, as shown below: 我有一个数据集
mydata
其中有84个变量,每个变量有300个观察值,如下所示:
I am using the following code to cluster mydata
in 5 clusters: 我正在使用以下代码将
mydata
集群到5个集群中:
mydata <- read.csv("mydata.csv", header = TRUE)
# K-Means Clustering with 5 clusters
fit <- kmeans(mydata, 5)
# Cluster Plot against first 2 principal components
# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
This produces the following plot: 这将产生以下图:
I expected it to plot the 84 variables with their names, as shown in the first image, based on the observations. 我期望它根据观察结果绘制出84个变量及其名称,如第一幅图所示。 But instead, as can be seen in the last image, it is clustering the 300 observations.
但是,正如从上一张图像中可以看到的那样,它正在对300个观测值进行聚类。 How to fix this?
如何解决这个问题?
I tried transposing mydata
, but that doesn't solve the issue. 我尝试转置
mydata
,但这不能解决问题。
EDIT: I expected it to plot something like this (but this plot is for another dataset). 编辑:我希望它可以绘制这样的东西(但此图用于另一个数据集)。 I show this plot only to show the names of the variables on the plot, which means the variables are getting plotted (based on observations).
我仅显示此图是为了显示图中的变量名称,这意味着将对变量进行绘制(基于观察结果)。
If you want to cluster variables, not instances , you can simply transpose your data matrix. 如果要聚类变量而不是实例 ,则可以简单地转置数据矩阵。
Usually, clustering is applied to data points, not columns. 通常,群集应用于数据点,而不是列。
Beware of the usual limitations of k-means. 提防k均值的通常限制。 It is very sensitive to scale.
它对规模非常敏感。
The plot you mention had probably been created by using the mtcars
dataset: 您提到的图可能是使用
mtcars
数据集创建的:
print(datasets::mtcars)
The points you see are clearly observations. 您看到的要点显然是观察结果。
If you want to create a cluster of variables, multiple options exists: 如果要创建变量集群,则存在多个选项:
cor
and do a hierarchical clustering with hclust
. cor
为例,并使用hclust
进行层次聚类。 kmeans
to cluster your variables. kmeans
对变量进行聚类。 Also, the question of why you want to cluster the variables probably need some more thought. 另外,为什么要对变量进行聚类的问题可能需要更多考虑。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.