[英]R cluster analysis and dendrogram with correlation matrix
I have to perform a cluster analysis on a big amount of data. 我必须对大量数据执行聚类分析。 Since I have a lot of missing values I made a correlation matrix. 由于我有很多缺失值,因此我建立了一个相关矩阵。
corloads = cor(df1[,2:185], use = "pairwise.complete.obs")
Now I have problems how to go on. 现在我有问题如何继续。 I read a lot of articles and examples, but nothing really works for me. 我读了很多文章和示例,但对我来说真的没有用。 How can I find out how many clusters are good for me? 如何找出对我有好处的集群?
I already tried this: 我已经尝试过了:
dissimilarity = 1 - corloads
distance = as.dist(dissimilarity)
plot(hclust(distance), main="Dissimilarity = 1 - Correlation", xlab="")
I got a plot, but its very messy and I dont know how to read it and how to go on. 我得到了一个情节,但是它非常混乱,我不知道该如何阅读以及如何进行。 It looks like this: 看起来像这样:
Any idea how to improve it? 知道如何改善吗? And what can I actually get out of it? 我到底能从中得到什么呢?
I also wanted to create a Screeplot. 我还想创建一个Screeplot。 I read that there will be a curve where you can see how many clusters are correct. 我读到会有一条曲线,您可以在其中看到多少个正确的聚类。
I also performed a cluster analysis and choose 2-20 Clusters, but the results are so long, I have no idea how to handle it and what things are important to look on. 我还进行了聚类分析,并选择了2-20个聚类,但是结果是如此之长,我不知道如何处理以及看什么很重要。
To determine the "optimal number of clusters" several methods are available, despite it is a controversy theme. 为了确定“最佳簇数”,尽管有争议,但仍可以使用几种方法。
The kgs
is helpful to get the optimal number of clusters. kgs
有助于获得最佳的群集数量。
Following your code one would do: 按照您的代码可以:
clus <- hclust(distance)
op_k <- kgs(clus, distance, maxclus = 20)
plot (names (op_k), op_k, xlab="# clusters", ylab="penalty")
So the optimal number of clusters according to the kgs
function is the minimum value of op_k
, as you can see in the plot. 因此,根据kgs
函数,最佳簇数是op_k
的最小值,如您在图中所见。 You can get it with 你可以用它
min(op_k)
Note that I set the maximum number of clusters allowed to 20. You can set this argument to NULL
. 请注意,我将允许的最大群集数设置为20。您可以将此参数设置为NULL
。
Check this page for more methods. 检查此页面以了解更多方法。
Hope it helps you. 希望对您有帮助。
To find which is the optimal number of clusters, you can do 要找到最佳的群集数量,您可以执行以下操作
op_k[which(op_k == min(op_k))]
Also see this post to find the perfect graphy answer from @Ben 另请参阅这篇文章以找到@Ben的完美图形答案
op_k[which(op_k == min(op_k))]
still gives penalty. 仍然会罚款。 To find the optimal number of clusters, use 要找到最佳群集数,请使用
as.integer(names(op_k[which(op_k == min(op_k))]))
I'm happy to learn about the kgs function. 我很高兴了解kgs功能。 Another option is using the find_k function from the dendextend package (it uses the average silhouette width). 另一个选择是使用dendextend包中的find_k函数(它使用平均轮廓宽度)。 But given the kgs function, I might just add it as another option to the package. 但是考虑到kgs函数,我可能只是将其添加为软件包的另一个选项。 Also note the dendextend::color_branches function, to color your dendrogram with the number of clusters you end up choosing (you can see more about this here: https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#setting-a-dendrograms-branches ) 还要注意dendextend :: color_branches函数,用最终选择的簇数为树状图着色(您可以在此处查看更多信息: https : //cran.r-project.org/web/packages/dendextend/vignettes /introduction.html#setting-a-dendrograms-branches )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.