简体   繁体   English

具有相关矩阵的R聚类分析和树状图

[英]R cluster analysis and dendrogram with correlation matrix

I have to perform a cluster analysis on a big amount of data. 我必须对大量数据执行聚类分析。 Since I have a lot of missing values I made a correlation matrix. 由于我有很多缺失值,因此我建立了一个相关矩阵。

corloads = cor(df1[,2:185], use = "pairwise.complete.obs")

Now I have problems how to go on. 现在我有问题如何继续。 I read a lot of articles and examples, but nothing really works for me. 我读了很多文章和示例,但对我来说真的没有用。 How can I find out how many clusters are good for me? 如何找出对我有好处的集群?

I already tried this: 我已经尝试过了:

dissimilarity = 1 - corloads
distance = as.dist(dissimilarity) 

plot(hclust(distance), main="Dissimilarity = 1 - Correlation", xlab="") 

I got a plot, but its very messy and I dont know how to read it and how to go on. 我得到了一个情节,但是它非常混乱,我不知道该如何阅读以及如何进行。 It looks like this: 看起来像这样:

在此处输入图片说明

Any idea how to improve it? 知道如何改善吗? And what can I actually get out of it? 我到底能从中得到什么呢?

I also wanted to create a Screeplot. 我还想创建一个Screeplot。 I read that there will be a curve where you can see how many clusters are correct. 我读到会有一条曲线,您可以在其中看到多少个正确的聚类。

I also performed a cluster analysis and choose 2-20 Clusters, but the results are so long, I have no idea how to handle it and what things are important to look on. 我还进行了聚类分析,并选择了2-20个聚类,但是结果是如此之长,我不知道如何处理以及看什么很重要。

To determine the "optimal number of clusters" several methods are available, despite it is a controversy theme. 为了确定“最佳簇数”,尽管有争议,但仍可以使用几种方法。

The kgs is helpful to get the optimal number of clusters. kgs有助于获得最佳的群集数量。

Following your code one would do: 按照您的代码可以:

clus <- hclust(distance)
op_k <- kgs(clus, distance, maxclus = 20)
plot (names (op_k), op_k, xlab="# clusters", ylab="penalty")

So the optimal number of clusters according to the kgs function is the minimum value of op_k , as you can see in the plot. 因此,根据kgs函数,最佳簇数是op_k的最小值,如您在图中所见。 You can get it with 你可以用它

min(op_k)

Note that I set the maximum number of clusters allowed to 20. You can set this argument to NULL . 请注意,我将允许的最大群集数设置为20。您可以将此参数设置为NULL

Check this page for more methods. 检查页面以了解更多方法。

Hope it helps you. 希望对您有帮助。

Edit 编辑

To find which is the optimal number of clusters, you can do 要找到最佳的群集数量,您可以执行以下操作

op_k[which(op_k == min(op_k))]

Plus

Also see this post to find the perfect graphy answer from @Ben 另请参阅这篇文章以找到@Ben的完美图形答案

Edit 编辑

op_k[which(op_k == min(op_k))]

still gives penalty. 仍然会罚款。 To find the optimal number of clusters, use 要找到最佳群集数,请使用

as.integer(names(op_k[which(op_k == min(op_k))]))

I'm happy to learn about the kgs function. 我很高兴了解kgs功能。 Another option is using the find_k function from the dendextend package (it uses the average silhouette width). 另一个选择是使用dendextend包中的find_k函数(它使用平均轮廓宽度)。 But given the kgs function, I might just add it as another option to the package. 但是考虑到kgs函数,我可能只是将其添加为软件包的另一个选项。 Also note the dendextend::color_branches function, to color your dendrogram with the number of clusters you end up choosing (you can see more about this here: https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#setting-a-dendrograms-branches ) 还要注意dendextend :: color_branches函数,用最终选择的簇数为树状图着色(您可以在此处查看更多信息: https : //cran.r-project.org/web/packages/dendextend/vignettes /introduction.html#setting-a-dendrograms-branches

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM