简体   繁体   中英

How to get gap statistic for hierarchical average clustering

I perform a hierarchical cluster analysis based on 'average linkage' In base r, I use

dist_mat <- dist(cdata, method = "euclidean")
hclust_avg <- hclust(dist_mat, method = "average")

I want to calculate the gap statistics to decide optimal number of clusters. I use the 'cluster' library and the clusGap function. Since I can't pass the hclust solution nor specify average hiearchical clustering in the clusGap function, I use these lines:

cluster_fun <- function(x, k) list(cluster = cutree(hclust(dist(x, method = "euclidean"), method="average"), k = k))
gap_stat <- clusGap(cdata, FUN=cluster_fun, K.max=10, B=50)
print(gap_stat)

However, here I can't check the cluster solution. So, my question is - can I be sure that the gap statistic is calculated on the same solution as hclust_avg?

Is there a better way of doing this?

Yes it should be the same. In the clusGap function, it calls the cluster_fun for each k you provided, then calculates the pooled within cluster sum of squares around, as described in the paper This is the bit of code called inside clusGap that calls your custom function:

W.k <- function(X, kk) {
        clus <- if (kk > 1) 
            FUNcluster(X, kk, ...)$cluster
        else rep.int(1L, nrow(X))
        0.5 * sum(vapply(split(ii, clus), function(I) {
            xs <- X[I, , drop = FALSE]
            sum(dist(xs)^d.power/nrow(xs))
        }, 0))
    }

And from here, the gap statistics is calculated.

You can calculate the gap statistic using some custom code, but for the sake of reproducibility, etc, it might be easier to use this?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM