简体   繁体   English

R levenshtein 距离中的聚类

[英]Clustering in R levenshtein distance

I am trying to use kmeans clustering using the levenshtein distance.我正在尝试使用 levenshtein 距离使用 kmeans 聚类。 I am having hard time in interpeting the results.我很难解释结果。

   # courtesy: code is borrowed from the other thread listed below with some additions of k-means clustering 
      set.seed(1)
  rstr <- function(n,k){   # vector of n random char(k) strings
 sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
  }

str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
    # Levenshtein Distance
  d  <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)

# to normalize the distances when there are unequal length sequences 
max<- max(d)
data<- d/max

k.means.fit <- kmeans(data, 3)
library(cluster)
clusplot(d, k.means.fit$cluster, main='Clustering',
     color=TRUE, shade=TRUE,
     labels=5, lines=0, col.p = "dark green")

so, what does the cluster plot and how can I interpret it?那么,聚类图是什么以及如何解释它? I referred to other threads where they discuss that is clustered on two principal components.我提到了他们讨论的其他线程,这些线程集中在两个主要组件上。 https://stats.stackexchange.com/questions/274754/how-to-interpret-the-clusplot-in-r https://stats.stackexchange.com/questions/274754/how-to-interpret-the-clusplot-in-r

But it was not clear how to explain the figure and why those points are in that ellipse/ cluster?但是不清楚如何解释这个数字以及为什么这些点在那个椭圆/簇中? Any ideas?有任何想法吗? Thanks!!谢谢!!

This is pretty straightforward.这很简单。 You constructed your strings to be in three groups.您将字符串构建为三组。 You have ten strings that start with 'aa', ten with 'bb' and ten with 'cc'.您有十个以 'aa' 开头的字符串,十个以 'bb' 开头的字符串,十个以 'cc' 开头的字符串。 After those beginnings, the rest of the string is random.在这些开头之后,字符串的其余部分是随机的。 Using Levenshtein distance, you would expect these strings that start with the same first two letters to be close to each other.使用 Levenshtein 距离,您会期望这些以相同的前两个字母开头的字符串彼此接近。 When you look at the plot of the hierarchical clustering it is easy to see three main groups defined by the first two letters of the strings.当您查看层次聚类图时,很容易看到由字符串的前两个字母定义的三个主要组。 When you use kmeans with k=3, you get the same clusters.当您使用 k=3 的 kmeans 时,您会得到相同的集群。 You can see this by checking the clusters您可以通过检查集群来看到这一点

 k.means.fit$cluster
aagjo aaxfx aayrq aabfe aarju aamsz aajuy aafqd aagka aajwi bbmpm bbevr bbucs 
    1     1     1     1     1     1     1     1     1     1     3     3     3 
bbkvq bbuon bbuam bbtsm bbwlg bbbci bbnrk ccxhl cciqg ccmtc ccwiv ccjim ccxwk 
    3     3     3     3     3     3     3     2     2     2     2     2     2 
ccuyl ccski cctfs ccdgd 
    2     2     2     2 

Cluster 1 is the strings that start with 'aa' cluster 2 starts with 'cc' and cluster 3 starts with 'bb'.簇 1 是以 'aa' 开头的字符串,簇 2 以 'cc' 开头,簇 3 以 'bb' 开头。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM