简体   繁体   中英

How to draw hierarchical clustering?

I have the following dataset:

data<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1))
for(i in 1:nrow(data)){ data[i,i]<-NA}
colnames(data) <- c("A","B","C","D")
rownames(data) <- c("A","B","C","D")
plot(hclust(dist(data)))

and then the result is the below image:

在此处输入图片说明

But, I am wondering how this plot is drawn. Here, I am trying to obtain the dendrogram step by step. We know that the distance matrix at the begining is as follow:

在此处输入图片说明

Every time we find the two points with minimum distance, and then merge them as a single cluster

在此处输入图片说明

So, the first merge are B, and C.And we update the distance matrix

在此处输入图片说明

Again we find the 2 points with minimum distance, which is D with cluster of B,C

在此处输入图片说明

Again we update the distance matrix

在此处输入图片说明

As a result I should have the following merges

  1. B, and C
  2. B,C, and D
  3. B,C,D, and A

But here the is a paradox with what R plot produced. So, how do you justify it?

Updated Response - Using single linkage rather than the default complete linkage.

I'll do my best to explain how I see this working. I believe this is as simple as the method argument used in hclust. The default method for hclust does not follow the algorithm that you laid out but we can adjust the method so it does.

But first, I am getting an error on the plot you are trying to make:

> data<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1))
> for(i in 1:nrow(data)){ data[i,i]<-NA}
> colnames(data) <- c("A","B","C","D")
> rownames(data) <- c("A","B","C","D")
> plot(hclust(dist(data)))
Error in hclust(dist(data)) : 
  NA/NaN/Inf in foreign function call (arg 11)

What is your intention with the for(i in 1:nrow(data)){ data[i,i]<-NA} line? After that line, your data object looks like this:

   X  Y V3 V4
1 NA  1 NA NA
2  2 NA NA NA
3  3  2 NA NA
4  4  1 NA NA

However, if we can just start with the following code, we can generate the desired tree as follows:

dt<-data.frame(X = c(1, 2, 3, 4), Y = c(1, 3, 2, 1))
rownames(dt) <- c("A", "B", "C", "D")
dt<-dist(dt)
plot(hclust(dt, method = "single"))

在此处输入图片说明

NOTE the change in method on the hclust call to method = single . The default method is method = complete . The complete linkage method does not join clusters to nodes based on the shortest distance but on the longest intercluster distance. Extracting some material from the fantastic Introduction to Statistical Learning with Applications in R which describes the various linkage methods available:

在此处输入图片说明

This text, by James, Witten, Hastie, and Tibshirani, is available as a free download at the link above. The section on hierarchical clustering starts on page 390. Please let me know if this helps clear things up.

Original Response

I think you are calling the dist function in the wrong manner and perhaps too many times. Try this:

dt<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1))
rownames(dt) <- c("A","B","C","D")
dt<-dist(dt)
plot(hclust((dt)))

在此处输入图片说明

Effectively, you were calling dist on an object which was already a class of dist that you then turned into a matrix and then called dist on again within your call to plot .

We can examine just the distance object as follows:

> dt
         A        B        C
B 2.236068                  
C 2.236068 1.414214         
D 3.000000 2.828427 1.414214

There is no need to call dist on this object again before passing it to the hclust function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM