简体   繁体   English

层次聚类和 k 均值

[英]Hierarchical clustering and k means

I want to run a hierarchical cluster analysis.我想运行层次聚类分析。 I am aware of the hclust() function but not how to use this in practice;我知道 hclust() function 但不知道如何在实践中使用它; I'm stuck with supplying the data to the function and processing the output.我坚持向 function 提供数据并处理 output。

The main issue that I would like to cluster a given measurement.我想对给定的测量进行聚类的主要问题。

I would also like to compare the hierarchical clustering with that produced by kmeans().我还想将层次聚类与 kmeans() 生成的层次聚类进行比较。 Again I am not sure how to call this function or use/manipulate the output from it.同样,我不确定如何调用此 function 或使用/操作其中的 output。

My data are similar to:我的数据类似于:

df<-structure(list(id=c(111,111,111,112,112,112), se=c(1,2,3,1,2,3),t1 = c(1, 2, 1, 1,1,3),
                   t2 = c(1, 2, 2, 1,1,4), t3 = c(1, 0, 0, 0,2,1), t4 = c(2, 5, 7,  7,1,2),
                   t5 = c(1, 0, 1, 1,1,1),t6 = c(1, 1, 1, 1,1,1), t7 = c(1, 1, 1 ,1,1,1), t8=c(0,0,0,0,0,0)), row.names = c(NA,
                                                                                                                            6L), class = "data.frame")

I would like to run the hierarchical cluster analysis to identify the optimum number of clusters.我想运行层次聚类分析以确定最佳聚类数。

How can I run clustering based on a predefined measurement - in this case for example to cluster measurement number 2?我如何根据预定义的测量运行聚类 - 在这种情况下例如聚类测量编号 2?

For hierarchical clustering there is one essential element you have to define.对于层次聚类,您必须定义一个基本元素。 It is the method for computing the distance between each data point.它是计算每个数据点之间距离的方法。 Clustering is an state of art technique so you have to define the number of clusters based on how fair data points are distributed.聚类是一项 state 的艺术技术,因此您必须根据数据点的公平分布来定义聚类的数量。 I will teach you how to do this in next code.我将在下一个代码中教您如何执行此操作。 We will compare three methods of distance using your data df and the function hclust() :我们将使用您的数据df和 function hclust()比较三种距离方法:

First method is average distance, which computes the mean across all distances for all points.第一种方法是平均距离,它计算所有点的所有距离的平均值。 We will omit first variable as it is an id:我们将省略第一个变量,因为它是一个 id:

#Method 1
hc.average <- hclust(dist(df[,-1]),method='average')

Second method is complete distance, which computes the largest value across all distances for all points:第二种方法是完整距离,它计算所有点的所有距离的最大值:

#Method 2
hc.complete<- hclust(dist(df[,-1]),method='complete')

Third method is single distance, which computes the minimal value across all distances for all points:第三种方法是单一距离,它计算所有点的所有距离的最小值:

#Method 3
hc.single <- hclust(dist(df[,-1]),method='single')

With all models we can analyze the groups.使用所有模型,我们可以分析组。

We can define the number of clusters based on the height of hierarchical tree, the largest the height then we will have only one cluster equals to all dataset.我们可以根据层次树的高度来定义簇的数量,高度最大的则我们将只有一个簇等于所有数据集。 It is a standard to choose an intermediate value for height.选择一个中间值的高度是一个标准。

With average method a height value of three will produce four groups and a value around 4.5 will produce 2 groups:使用平均方法,高度值为 3 将产生四组,而 4.5 左右的值将产生 2 组:

plot(hc.average, xlab='')

Output: Output:

在此处输入图像描述

With the complete method results are similar but the scale measure of height has changed.使用完整的方法,结果相似,但高度的比例尺发生了变化。

plot(hc.complete, xlab='')

Output: Output:

在此处输入图像描述

Finally, single method produces a different scheme for groups.最后,单一方法为组产生不同的方案。 There are three groups and even with an intermediate choice of height, you will always have that number of clusters:共有三组,即使选择了中间高度,您也将始终拥有该数量的簇:

plot(hc.single, xlab='')

Output: Output:

在此处输入图像描述

You can use any method you wish to determine the cluster for your data using cutree() function, where you set the model object and the number of clusters.您可以使用任何您希望使用cutree() function 来确定数据集群的方法,您可以在其中设置 model object 和集群数。 One way to determine clustering performance is checking how homogeneous the groups are.确定聚类性能的一种方法是检查组的同质性。 That depends of the researcher criteria.这取决于研究人员的标准。 Next the method to add the cluster to your data.接下来是将集群添加到数据的方法。 I will choose last model and three groups:我会选择最后一个 model 和三组:

#Add cluster
df$Cluster <- cutree(hc.single,k = 3)

Output: Output:

   id se t1 t2 t3 t4 t5 t6 t7 t8 Cluster
1 111  1  1  1  1  2  1  1  1  0       1
2 111  2  2  2  0  5  0  1  1  0       2
3 111  3  1  2  0  7  1  1  1  0       2
4 112  1  1  1  0  7  1  1  1  0       2
5 112  2  1  1  2  1  1  1  1  0       1
6 112  3  3  4  1  2  1  1  1  0       3

The function cutree() also has an argument called h where you can set the height, we have talked previously, instead of number of clusters k . function cutree()也有一个名为h的参数,您可以在其中设置高度,我们之前已经讨论过,而不是簇k

About your doubt of using some measure to define a cluster, you could scale your data excluding the desired variable so that the variable will have a different measure and can influence in the results of your clustering.关于您对使用某种度量来定义集群的怀疑,您可以扩展数据,排除所需的变量,以便该变量具有不同的度量,并可以影响聚类的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM