简体   繁体   English

从 R 中的切割树状图中提取标签成员/分类(即:树状图的可爱树函数)

[英]Extract labels membership / classification from a cut dendrogram in R (i.e.: a cutree function for dendrogram)

I'm trying to extract a classification from a dendrogram in R that I've cut at a certain height.我试图从 R 中我在某个高度cut的树状图中提取分类。 This is easy to do with cutree on an hclust object, but I can't figure out how to do it on a dendrogram object.这是容易做到cutree的上hclust对象,但我无法弄清楚如何做到这一点的一个dendrogram对象。

Further, I can't just use my clusters from the original hclust, becuase (frustratingly), the numbering of the classes from cutree is different from the numbering of classes with cut .此外,我不能只使用来自原始 hclust 的集群,因为(令人沮丧),来自cutree的类的编号与带有cut的类的编号不同。

hc <- hclust(dist(USArrests), "ave")

classification<-cutree(hc,h=70)

dend1 <- as.dendrogram(hc)
dend2 <- cut(dend1, h = 70)


str(dend2$lower[[1]]) #group 1 here is not the same as
classification[classification==1] #group 1 here

Is there a way to either get the classifications to map to each other, or alternatively to extract lower branch memberships from the dendrogram object (perhaps with some clever use of dendrapply ?) in a format more like what cutree gives?有没有办法让分类相互映射,或者以更像cutree给出的格式从dendrogram对象中提取较低的分支成员(也许巧妙地使用dendrapply ?)?

I would propose for you to use the cutree function from the dendextend package.我建议您使用dendextend包中的cutree功能。 It includes a dendrogram method (ie: dendextend:::cutree.dendrogram ).它包括一个树状图方法(即: dendextend:::cutree.dendrogram )。

You can learn more about the package from its introductory vignette .您可以从其介绍性小插图中了解有关该软件包的更多信息。

I should add that while your function ( classify ) is good, there are several advantage for using cutree from dendextend :我要补充一点,虽然你的函数( classify )是很好的,有几个优势,利用cutreedendextend:

  1. It also allows you to use a specific k (number of clusters), and not just h (a specific height).它还允许您使用特定的k (簇数),而不仅仅是h (特定的高度)。

  2. It is consistent with the result you would get from cutree on hclust ( classify will not be).它与您在 hclust 上从 cuttree 获得的结果一致( classify不会)。

  3. It will often be faster.它通常会更快。

Here are examples for using the code:以下是使用代码的示例:

# Toy data:
hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)

# Get the package:
install.packages("dendextend")
library(dendextend)

# Get the package:
cutree(dend1,h=70) # it now works on a dendrogram
# It is like using:
dendextend:::cutree.dendrogram(dend1,h=70)

By the way, on the basis of this function, dendextend allows the user to do more cool things, like color branches/labels based on cutting the dendrogram:顺便说一下,在这个功能的基础上, dendextend允许用户做更多很酷的事情,比如基于切割树状图的颜色分支/标签:

dend1 <- color_branches(dend1, k = 4)
dend1 <- color_labels(dend1, k = 5)
plot(dend1)

在此处输入图片说明

Lastly, here is some more code for demonstrating my other points:最后,这里还有一些代码来演示我的其他观点:

# This would also work with k:
cutree(dend1,k=4)

# and would give identical result as cutree on hclust:
identical(cutree(hc,h=70)  , cutree(dend1,h=70)  )
   # TRUE

# But this is not the case for classify:
identical(classify(dend1,70)   , cutree(dend1,h=70)  )
   # FALSE


install.packages("microbenchmark")
require(microbenchmark)
microbenchmark(classify = classify(dend1,70),
               cutree = cutree(dend1,h=70)  )
#    Unit: milliseconds
#        expr      min       lq   median       uq       max neval
#    classify  9.70135  9.94604 10.25400 10.87552  80.82032   100
#      cutree 37.24264 37.97642 39.23095 43.21233 141.13880   100
# 4 times faster for this tree (it will be more for larger trees)

# Although (if to be exact about it) if I force cutree.dendrogram to not go through hclust (which can happen for "weird" trees), the speed will remain similar:
microbenchmark(classify = classify(dend1,70),
               cutree = cutree(dend1,h=70, try_cutree_hclust = FALSE)  )
# Unit: milliseconds
#        expr       min        lq    median       uq      max neval
#    classify  9.683433  9.819776  9.972077 10.48497 29.73285   100
#      cutree 10.275839 10.419181 10.540126 10.66863 16.54034   100

If you are thinking of ways to improve this function, please patch it through here:如果您正在考虑改进此功能的方法,请在此处进行修补:

https://github.com/talgalili/dendextend/blob/master/R/cutree.dendrogram.R https://github.com/talgalili/dendextend/blob/master/R/cutree.dendrogram.R

I hope you, or others, will find this answer helpful.我希望你或其他人会发现这个答案很有帮助。

I ended up creating a function to do it using dendrapply .我最终创建了一个函数来使用dendrapply来完成它。 It's not elegant, but it works它不优雅,但它有效

classify <- function(dendrogram,height){

#mini-function to use with dendrapply to return tip labels
 members <- function(n) {
    labels<-c()
    if (is.leaf(n)) {
        a <- attributes(n)
        labels<-c(labels,a$label)
    }
    labels
 }

 dend2 <- cut(dendrogram,height) #the cut dendrogram object
 branchesvector<-c()
 membersvector<-c()

 for(i in 1:length(dend2$lower)){                             #for each lower tree resulting from the cut
  memlist <- unlist(dendrapply(dend2$lower[[i]],members))     #get the tip lables
  branchesvector <- c(branchesvector,rep(i,length(memlist)))  #add the lower tree identifier to a vector
  membersvector <- c(membersvector,memlist)                   #add the tip labels to a vector
 }
out<-as.integer(branchesvector)                               #make the output a list of named integers, to match cut() output
names(out)<-membersvector
out
}

Using the function makes it clear that the problem is that cut assigns category names alphabetically while cutree assigns branch names left to right.使用该函数可以清楚地表明问题在于 cut 按字母顺序分配类别名称,而 cuttree 从左到右分配分支名称。

hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)

classify(dend1,70) #Florida 1, North Carolina 1, etc.
cutree(hc,h=70)    #Alabama 1, Arizona 1, Arkansas 1, etc.

Once you make your dendogram, use the cutree method and then convert it to a dataframe.制作树状图后,请使用 cuttree 方法,然后将其转换为数据框。 The following code makes a nice dendrogram using the library dendextend:以下代码使用库 dendextend 制作了一个很好的树状图:

library(dendextend)

# set the number of clusters
clust_k <- 8

# make the hierarchical clustering
par(mar = c(2.5, 0.5, 1.0, 7))
d <- dist(mat, method = "euclidean")
hc <- hclust(d)
dend <- d %>% hclust %>% as.dendrogram
labels_cex(dend) <- .65
dend %>% 
  color_branches(k=clust_k) %>%
  color_labels() %>%
  highlight_branches_lwd(3) %>% 
  plot(horiz=TRUE, main = "Branch (Distribution) Clusters by Heloc Attributes", axes = T)

在此处输入图片说明

Based on the coloring scheme, it looks like the clusters are formed around the threshold of 4. So to get the assignments into a dataframe, we need to get the clusters and then unlist() them.根据配色方案,看起来集群是围绕阈值 4 形成的。因此,要将分配放入数据帧中,我们需要获取集群,然后unlist()它们。

First you need to get the clusters themselves, however, it is just a single vector of the number, the row names are the actual labels.首先,您需要自己获取集群,但是,它只是数字的单个向量,行名称是实际标签。

# creates a single item vector of the clusters    
myclusters <- cutree(dend, k=clust_k, h=4)

# make the dataframe of two columns cluster number and label
clusterDF <-  data.frame(Cluster = as.numeric(unlist(myclusters)),
                         Branch = names(myclusters))

# sort by cluster ascending
clusterDFSort <- clusterDF %>% arrange(Cluster)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM