简体   繁体   中英

Clustering in R levenshtein distance

I am trying to use kmeans clustering using the levenshtein distance. I am having hard time in interpeting the results.

   # courtesy: code is borrowed from the other thread listed below with some additions of k-means clustering 
      set.seed(1)
  rstr <- function(n,k){   # vector of n random char(k) strings
 sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
  }

str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
    # Levenshtein Distance
  d  <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)

# to normalize the distances when there are unequal length sequences 
max<- max(d)
data<- d/max

k.means.fit <- kmeans(data, 3)
library(cluster)
clusplot(d, k.means.fit$cluster, main='Clustering',
     color=TRUE, shade=TRUE,
     labels=5, lines=0, col.p = "dark green")

so, what does the cluster plot and how can I interpret it? I referred to other threads where they discuss that is clustered on two principal components. https://stats.stackexchange.com/questions/274754/how-to-interpret-the-clusplot-in-r

But it was not clear how to explain the figure and why those points are in that ellipse/ cluster? Any ideas? Thanks!!

This is pretty straightforward. You constructed your strings to be in three groups. You have ten strings that start with 'aa', ten with 'bb' and ten with 'cc'. After those beginnings, the rest of the string is random. Using Levenshtein distance, you would expect these strings that start with the same first two letters to be close to each other. When you look at the plot of the hierarchical clustering it is easy to see three main groups defined by the first two letters of the strings. When you use kmeans with k=3, you get the same clusters. You can see this by checking the clusters

 k.means.fit$cluster
aagjo aaxfx aayrq aabfe aarju aamsz aajuy aafqd aagka aajwi bbmpm bbevr bbucs 
    1     1     1     1     1     1     1     1     1     1     3     3     3 
bbkvq bbuon bbuam bbtsm bbwlg bbbci bbnrk ccxhl cciqg ccmtc ccwiv ccjim ccxwk 
    3     3     3     3     3     3     3     2     2     2     2     2     2 
ccuyl ccski cctfs ccdgd 
    2     2     2     2 

Cluster 1 is the strings that start with 'aa' cluster 2 starts with 'cc' and cluster 3 starts with 'bb'.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM