简体   繁体   中英

How do I choose a linkage method for Hierarchical Agglomerative Clustering?

I understand that HAC has several options in terms of linkage functions. You have:

  • Single linkage which produces "straggly" clusters
  • Complete linkage which produces tight, spherical clusters
  • Average linkage which is sort of a compromise between the two
  • Ward's method, which is based more off the variance than actual distance

What I'm trying to figure out is, how do I know which one of these I want to use? Are there certain datasets where "straggly" clusters are preferable to spherical ones? Or is it more a function of what I intend to do with the clustering data?

It depends on your data.

Single-linkage works reasonably well on clean data.

If you have dirty data, the other linkages may be better.

Ward is similar to k-means. It may be a good choice if you want to talk about centroids and data partitioned completely into disjoint subsets.

The other problem is that only SLINK (for single-linkabe) is fast. All the others usually work in O(n^3) so they are not usable on large data sets. Compare this to eg DBSCAN which runs in O(n log n) if done well, or kmeans in O(n)...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM