[英]R getting subtrees from dendrogram based on cutree labels
I have clustered a large dataset and found 6 clusters I am interested in analyzing more in depth. 我已经聚集了一个大型数据集,发现了6个我感兴趣的集群,可以进行更深入的分析。
I found the clusters using hclust with "ward.D" method, and I would like to know whether there is a way to get "sub-trees" from hclust/dendrogram objects. 我使用带有“ward.D”方法的hclust找到了集群,我想知道是否有办法从hclust / dendrogram对象中获取“子树”。
For example 例如
library(gplots)
library(dendextend)
data <- iris[,1:4]
distance <- dist(data, method = "euclidean", diag = FALSE, upper = FALSE)
hc <- hclust(distance, method = 'ward.D')
dnd <- as.dendrogram(hc)
plot(dnd) # to decide the number of clusters
clusters <- cutree(dnd, k = 6)
I used cutree
to get the labels for each of the rows in my dataset. 我使用
cutree
来获取数据集中每个行的标签。
I know I can get the data for each corresponding cluster (cluster 1 for example) with: 我知道我可以获得每个相应集群(例如集群1)的数据:
c1_data = data[clusters == 1,]
Is there any easy way to get the subtrees for each corresponding label as returned by dendextend::cutree
? 是否有任何简单的方法来获取
dendextend::cutree
返回的每个相应标签的子树? For example, say I am interesting in getting the 例如,说我有兴趣获得
I know I can access the branches of the dendrogram doing something like 我知道我可以访问树形图的分支
subtree <- dnd[[1]][[2]
but how I can get exactly the subtree corresponding to cluster 1? 但我如何才能准确得到与簇1对应的子树?
I have tried 我试过了
dnd[clusters == 1]
but this of course doesn't work. 但这当然不起作用。 So how can I get the subtree based on the labels returned by cutree?
那么如何根据cutree返回的标签获取子树呢?
================= UPDATED answer =================更新的答案
This can now be solved using the get_subdendrograms
from dendextend
. 现在可以使用
get_subdendrograms
的dendextend
来解决这个问题。
# needed packages:
# install.packages(gplots)
# install.packages(viridis)
# install.packages(devtools)
# devtools::install_github('talgalili/dendextend') # dendextend from github
# define dendrogram object to play with:
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>% set("labels_to_character") %>% color_branches(k=5)
dend_list <- get_subdendrograms(dend, 5)
# Plotting the result
par(mfrow = c(2,3))
plot(dend, main = "Original dendrogram")
sapply(dend_list, plot)
This can also be used within a heatmap: 这也可以在热图中使用:
# plot a heatmap of only one of the sub dendrograms
par(mfrow = c(1,1))
library(gplots)
sub_dend <- dend_list[[1]] # get the sub dendrogram
# make sure of the size of the dend
nleaves(sub_dend)
length(order.dendrogram(sub_dend))
# get the subset of the data
subset_iris <- as.matrix(iris[order.dendrogram(sub_dend),-5])
# update the dendrogram's internal order so to not cause an error in heatmap.2
order.dendrogram(sub_dend) <- rank(order.dendrogram(sub_dend))
heatmap.2(subset_iris, Rowv = sub_dend, trace = "none", col = viridis::viridis(100))
================= OLDER answer =================老人回答
I think what can be helpful for you are these two functions: 我认为对你有帮助的是这两个功能:
The first one just iterates through all clusters and extracts substructure. 第一个只是遍历所有聚类并提取子结构。 It requires:
这个需要:
dendrogram
object from which we want to get the subdendrograms dendrogram
对象 cutree
) cutree
返回) Returns a list of subdendrograms. 返回子树形图列表。
extractDendrograms <- function(dendr, clusters){
lapply(unique(clusters), function(clust.id){
getSubDendrogram(dendr, which(clusters==clust.id))
})
}
The second one performs a depth-first search to determine in which subtree the cluster exists and if it matches the full cluster returns it. 第二个执行深度优先搜索以确定群集存在于哪个子树中,以及它是否与完整群集匹配则返回它。 Here, we use the assumption that all elements of a cluster are in one subtress.
在这里,我们假设集群的所有元素都在一个子索引中。 It requires:
这个需要:
Returns a subdendrograms corresponding to the cluster of given elements. 返回与给定元素的簇相对应的子树形图。
getSubDendrogram<-function(dendr, my.clust){
if(all(unlist(dendr) %in% my.clust))
return(dendr)
if(any(unlist(dendr[[1]]) %in% my.clust ))
return(getSubDendrogram(dendr[[1]], my.clust))
else
return(getSubDendrogram(dendr[[2]], my.clust))
}
Using these two functions we can use the variables you have provided in the question and get the following output. 使用这两个函数,我们可以使用您在问题中提供的变量并获得以下输出。 (I think the line
clusters <- cutree(dnd, k = 6)
should be clusters <- cutree(hc, k = 6)
) (我认为线
clusters <- cutree(dnd, k = 6)
应该是clusters <- cutree(hc, k = 6)
)
my.sub.dendrograms <- extractDendrograms(dnd, clusters)
plotting all six elements from the list gives all subdendrograms 绘制列表中的所有六个元素给出所有子树形图
EDIT 编辑
As suggested in the comment, I add a function that as an input takes a dendrogram dend
and the number of subtrees k
, but it still uses the previously defined, recursive function getSubDendrogram
: 正如评论中所建议的,我添加了一个函数,作为输入采用树形图
dend
和子树k
,但它仍然使用先前定义的递归函数getSubDendrogram
:
prune_cutree_to_dendlist <- function(dend, k, order_clusters_as_data=FALSE) {
clusters <- cutree(dend, k, order_clusters_as_data)
lapply(unique(clusters), function(clust.id){
getSubDendrogram(dend, which(clusters==clust.id))
})
}
A test case for 5 substructures: 5个子结构的测试用例:
library(dendextend)
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>% set("labels_to_character") %>% color_branches(k=5)
subdend.list <- prune_cutree_to_dendlist(dend, 5)
#plotting
par(mfrow = c(2,3))
plot(dend, main = "original dend")
sapply(prunned_dends, plot)
I have performed some benchmark using rbenchmark
with the function suggested by Tal Galili (here named prune_cutree_to_dendlist2
) and the results are quite promising for the DFS approach from the above: 我使用
rbenchmark
执行了一些基准测试,其功能是Tal Galili建议的功能(此处命名为prune_cutree_to_dendlist2
),结果对于上述DFS方法非常有希望:
library(rbenchmark)
benchmark(prune_cutree_to_dendlist(dend, 5),
prune_cutree_to_dendlist2(dend, 5), replications=5)
test replications elapsed relative user.self
1 prune_cutree_to_dendlist(dend, 5) 5 0.02 1 0.020
2 prune_cutree_to_dendlist2(dend, 5) 5 60.82 3041 60.643
I wrote now function prune_cutree_to_dendlist
to do what you asked for. 我现在写了函数
prune_cutree_to_dendlist
来做你要求的。 I should add it to dendextend at some point in the future. 我应该在将来的某个时候将它添加到dendextend中。
In the meantime, here is an example of the code and output (the function is a bit slow. Making it faster relies on having prune be faster, which I won't get to fixing in the near future.) 在此期间,这里是一个代码和输出的例子(函数有点慢。使它更快依赖于修剪更快,我不会在不久的将来修复。)
# install.packages("dendextend")
library(dendextend)
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>%
set("labels_to_character")
dend <- dend %>% color_branches(k=5)
# plot(dend)
prune_cutree_to_dendlist <- function(dend, k) {
clusters <- cutree(dend,k, order_clusters_as_data = FALSE)
# unique_clusters <- unique(clusters) # could also be 1:k but it would be less robust
# k <- length(unique_clusters)
# for(i in unique_clusters) {
dends <- vector("list", k)
for(i in 1:k) {
leves_to_prune <- labels(dend)[clusters != i]
dends[[i]] <- prune(dend, leves_to_prune)
}
class(dends) <- "dendlist"
dends
}
prunned_dends <- prune_cutree_to_dendlist(dend, 5)
sapply(prunned_dends, nleaves)
par(mfrow = c(2,3))
plot(dend, main = "original dend")
sapply(prunned_dends, plot)
How did you get 6 clusters using hclust? 你是如何使用hclust获得6个集群的? You can cut the tree at any point, so you just ask cuttree to give you more clusters:
您可以随时剪切树,因此您只需要让cuttree为您提供更多集群:
clusters = cutree(hclusters, number_of_clusters)
If you have a lot of data this may not be very handy though. 如果你有很多数据,这可能不是很方便。 In these cases what I do is manually picking the clusters that I want to study further and then running hclust only on the data in these clusters.
在这些情况下,我所做的是手动选择我想要进一步研究的集群,然后仅对这些集群中的数据运行hclust。 I don't know of any functionality in hclust that allows you to do this automatically, but it's quite easy:
我不知道hclust中的任何功能允许你自动执行此操作,但它很容易:
good_clusters = c(which(clusters==1),
which(clusters==2)) #or whichever cLusters you want
new_df = df[good_clusters,]
new_hclusters = hclust(new_df)
new_clusters = cutree(new_hclusters, new_number_of_clusters)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.