简体   繁体   中英

Make dendrograms more readable in R

I am working with 1800 observations to classify them. I apply a dendrogram analysis in which I represent the data. I identify three groups. The problem comes when it comes to visualizing the data. They are not readable. At the bottom, there is a lot of overlapping data. The labels are numbers, but I don't know how I managed to make them more readable. I have tried two options and neither is fruitful.

Option 1:

m  <- as.matrix(dtm)

distMatrix <- dist(m, method="euclidean")

groups <- hclust(distMatrix,method="ward.D")

clustering <- cutree(groups,3)

plot(groups, hang = -100, cex = 1, labels=FALSE)
rect.hclust(groups, k=3)

在此处输入图像描述

Option 2:

    m  <- as.matrix(dtm)
    
    distMatrix <- dist(m, method="euclidean")
    
    groups <- hclust(distMatrix,method="ward.D")
    
fviz_dend(groups, cex = 0.8, lwd = 0.8, k = 3, 
          rect = TRUE, 
          k_colors = "jco", 
          rect_border = "jco", 
          rect_fill = TRUE,
          ggtheme = theme_gray(),labels=F)

在此处输入图像描述

How can I represent the dendrogram without so much overlapping data appearing at the bottom? It looks very confusing with so much data together.

Two things might help: make the y -axis log-scale, and reduce line thickness.

The former is easy, but changing the line thickness of an existing ggplot object is fiddly.

Below is an example of what I have done in my recent analysis. I didn't use the fviz_dend function, instead I used as.dendrogram followed by ggplot() .

If you want to work with your existing fviz plot, you could change the line thickness using the same method.

Also with a large number of leaves, you might as well hide the labels (ie expand=c(0,0) in scale_y )


Calculate the hierarchical clustering:

require(RColorBrewer)
require(stats)
require(dendextend)
n = 4
hdata <- hclust(dist(data, "minkowski", p=2), method="ward.D")
clusters = cutree(hdata, k = n)
# vector of up to 16 different colours
col_vector = c(brewer.pal(n=10,"Paired"), brewer.pal(n=6,"Set2")) 

Plot before:

hdata %>%
  as.dendrogram %>%
  color_branches(k = n, col = col_vector) %>%
  ggplot() + theme_classic() + theme.text +
  theme(panel.grid.major.y = element_line(),axis.title=element_blank(),
        axis.title.y=element_blank(),axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  scale_y_continuous(expand=c(0,0)) +
  scale_x_continuous(expand=c(0.001,0.001)) +
  labs(y="")

在此处输入图像描述

Plot after:

b = hdata %>%
  as.dendrogram %>%
  color_branches(k = n, col = col_vector) %>%
  ggplot() + theme_classic() + theme.text +
  theme(panel.grid.major.y = element_line(),axis.title=element_blank(),
        axis.title.y=element_blank(),axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  scale_y_log10() +
  scale_x_continuous(expand=c(0.001,0.001)) +
  labs(y="")
# Adjust the line thickness
b = ggplot_build(b)
b$data[[1]]$size = 0.2
b = ggplot_gtable(b)
plot(b)

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM