具有相关矩阵的R聚类分析和树状图

Question

I have to perform a cluster analysis on a big amount of data. 我必须对大量数据执行聚类分析。 Since I have a lot of missing values I made a correlation matrix. 由于我有很多缺失值，因此我建立了一个相关矩阵。

corloads = cor(df1[,2:185], use = "pairwise.complete.obs")

Now I have problems how to go on. 现在我有问题如何继续。 I read a lot of articles and examples, but nothing really works for me. 我读了很多文章和示例，但对我来说真的没有用。 How can I find out how many clusters are good for me? 如何找出对我有好处的集群？

I already tried this: 我已经尝试过了：

dissimilarity = 1 - corloads
distance = as.dist(dissimilarity) 

plot(hclust(distance), main="Dissimilarity = 1 - Correlation", xlab="")

I got a plot, but its very messy and I dont know how to read it and how to go on. 我得到了一个情节，但是它非常混乱，我不知道该如何阅读以及如何进行。 It looks like this: 看起来像这样：

Any idea how to improve it? 知道如何改善吗？ And what can I actually get out of it? 我到底能从中得到什么呢？

I also wanted to create a Screeplot. 我还想创建一个Screeplot。 I read that there will be a curve where you can see how many clusters are correct. 我读到会有一条曲线，您可以在其中看到多少个正确的聚类。

I also performed a cluster analysis and choose 2-20 Clusters, but the results are so long, I have no idea how to handle it and what things are important to look on. 我还进行了聚类分析，并选择了2-20个聚类，但是结果是如此之长，我不知道如何处理以及看什么很重要。

Answer 1

To determine the "optimal number of clusters" several methods are available, despite it is a controversy theme. 为了确定“最佳簇数”，尽管有争议，但仍可以使用几种方法。

The kgs is helpful to get the optimal number of clusters. kgs有助于获得最佳的群集数量。

Following your code one would do: 按照您的代码可以：

clus <- hclust(distance)
op_k <- kgs(clus, distance, maxclus = 20)
plot (names (op_k), op_k, xlab="# clusters", ylab="penalty")

So the optimal number of clusters according to the kgs function is the minimum value of op_k , as you can see in the plot. 因此，根据kgs函数，最佳簇数是op_k的最小值，如您在图中所见。 You can get it with 你可以用它

min(op_k)

Note that I set the maximum number of clusters allowed to 20. You can set this argument to NULL . 请注意，我将允许的最大群集数设置为20。您可以将此参数设置为NULL 。

Check this page for more methods. 检查此页面以了解更多方法。

Hope it helps you. 希望对您有帮助。

Edit 编辑

To find which is the optimal number of clusters, you can do 要找到最佳的群集数量，您可以执行以下操作

op_k[which(op_k == min(op_k))]

Plus 加

Also see this post to find the perfect graphy answer from @Ben 另请参阅这篇文章以找到@Ben的完美图形答案

Edit 编辑

op_k[which(op_k == min(op_k))]

still gives penalty. 仍然会罚款。 To find the optimal number of clusters, use 要找到最佳群集数，请使用

as.integer(names(op_k[which(op_k == min(op_k))]))

Answer 2

I'm happy to learn about the kgs function. 我很高兴了解kgs功能。 Another option is using the find_k function from the dendextend package (it uses the average silhouette width). 另一个选择是使用dendextend包中的find_k函数（它使用平均轮廓宽度）。 But given the kgs function, I might just add it as another option to the package. 但是考虑到kgs函数，我可能只是将其添加为软件包的另一个选项。 Also note the dendextend::color_branches function, to color your dendrogram with the number of clusters you end up choosing (you can see more about this here: https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#setting-a-dendrograms-branches ) 还要注意dendextend :: color_branches函数，用最终选择的簇数为树状图着色（您可以在此处查看更多信息： https : //cran.r-project.org/web/packages/dendextend/vignettes /introduction.html#setting-a-dendrograms-branches ）

具有相关矩阵的R聚类分析和树状图

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-12-12 16:17:54

Edit 编辑

Plus 加

Edit 编辑

解决方案2
1 2017-12-16 12:58:37

具有相关矩阵的R聚类分析和树状图

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-12-12 16:17:54

Edit 编辑

Plus 加

Edit 编辑

解决方案2 1 2017-12-16 12:58:37

解决方案1
3 已采纳 2017-12-12 16:17:54

解决方案2
1 2017-12-16 12:58:37