R cluster analysis and dendrogram with correlation matrix

Question

I have to perform a cluster analysis on a big amount of data. Since I have a lot of missing values I made a correlation matrix.

corloads = cor(df1[,2:185], use = "pairwise.complete.obs")

Now I have problems how to go on. I read a lot of articles and examples, but nothing really works for me. How can I find out how many clusters are good for me?

I already tried this:

dissimilarity = 1 - corloads
distance = as.dist(dissimilarity) 

plot(hclust(distance), main="Dissimilarity = 1 - Correlation", xlab="")

I got a plot, but its very messy and I dont know how to read it and how to go on. It looks like this:

Any idea how to improve it? And what can I actually get out of it?

I also wanted to create a Screeplot. I read that there will be a curve where you can see how many clusters are correct.

I also performed a cluster analysis and choose 2-20 Clusters, but the results are so long, I have no idea how to handle it and what things are important to look on.

Answer 1

To determine the "optimal number of clusters" several methods are available, despite it is a controversy theme.

The kgs is helpful to get the optimal number of clusters.

Following your code one would do:

clus <- hclust(distance)
op_k <- kgs(clus, distance, maxclus = 20)
plot (names (op_k), op_k, xlab="# clusters", ylab="penalty")

So the optimal number of clusters according to the kgs function is the minimum value of op_k , as you can see in the plot. You can get it with

min(op_k)

Note that I set the maximum number of clusters allowed to 20. You can set this argument to NULL .

Check this page for more methods.

Hope it helps you.

Edit

To find which is the optimal number of clusters, you can do

op_k[which(op_k == min(op_k))]

Plus

Also see this post to find the perfect graphy answer from @Ben

Edit

op_k[which(op_k == min(op_k))]

still gives penalty. To find the optimal number of clusters, use

as.integer(names(op_k[which(op_k == min(op_k))]))

Answer 2

I'm happy to learn about the kgs function. Another option is using the find_k function from the dendextend package (it uses the average silhouette width). But given the kgs function, I might just add it as another option to the package. Also note the dendextend::color_branches function, to color your dendrogram with the number of clusters you end up choosing (you can see more about this here: https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#setting-a-dendrograms-branches )

R cluster analysis and dendrogram with correlation matrix

Question

2 answers

solution1
3 ACCPTED 2017-12-12 16:17:54

Edit

Plus

Edit

solution2
1 2017-12-16 12:58:37

R cluster analysis and dendrogram with correlation matrix

Question

2 answers

solution1 3 ACCPTED 2017-12-12 16:17:54

Edit

Plus

Edit

solution2 1 2017-12-16 12:58:37

solution1
3 ACCPTED 2017-12-12 16:17:54

solution2
1 2017-12-16 12:58:37