简体   繁体   English

R:使用相似性或相异性矩阵进行聚类? 并可视化结果

[英]R: clustering with a similarity or dissimilarity matrix? And visualizing the results

I have a similarity matrix that I created using Harry —a tool for string similarity, and I wanted to plot some dendrograms out of it to see if I could find some clusters / groups in the data. 我有一个使用Harry创建的相似度矩阵-一种用于字符串相似度的工具,我想从中绘制一些树状图,以查看是否可以在数据中找到某些聚类/组。 I'm using the following similarity measures: 我正在使用以下相似性度量:

  • Normalized compression distance (NCD) 归一化压缩距离(NCD)
  • Damerau-Levenshtein distance Damerau-Levenshtein距离
  • Jaro-Winkler distance Jaro-Winkler距离
  • Levenshtein distance 莱文施泰因距离
  • Optimal string alignment distance (OSA) 最佳字符串对齐距离(OSA)

("For comparison Harry loads a set of strings from input, computes the specified similarity measure and writes a matrix of similarity values to output") (“为了进行比较,Harry从输入中加载了一组字符串,计算了指定的相似性度量,并将相似性值矩阵写入输出中”)

At first, it was like my first time using R, I didn't pay to much attention on the documentation of hclust , so I used it with a similarity matrix . 一开始,就像我第一次使用R一样,我对hclust的文档并没有太hclust ,所以我将它与相似矩阵一起使用 I know I should have used a dissimilarity matrix , and I know, since my similarity matrix is normalized [0,1], that I could just do dissimilarity = 1 - similarity and then use hclust . 我知道我应该使用不相似矩阵 ,并且由于我的相似矩阵被归一化[0,1],所以我可以做不相似= 1-相似 ,然后使用hclust

But, the groups that I get using hclust with a similarity matrix are much better than the ones I get using hclust and it's correspondent dissimilarity matrix . 但是,我将hclust相似性矩阵组合使用的组要比我对hclust及其对应的相异性矩阵组合的组合要好得多。

I tried to use the proxy package as well and the same problem, the groups that I get aren't what I expected, happens. 我也尝试使用proxy程序包,并且发生了相同的问题,即我得到的组不是我期望的。

To get the dendrograms using the similarity function I do: 要使用相似度函数获取树状图,我要做:

  1. plot(hclust(as.dist(""similarityMATRIX""), "average"))

With the dissimilarity matrix I tried: 使用差异矩阵,我尝试过:

  1. plot(hclust(as.dist(""dissimilarityMATRIX""), "average"))

and

  1. plot(hclust(as.sim(""dissimilarityMATRIX""), "average"))

From (1) I get what I believe to be a very good dendrogram, and so I can get very good groups out of it. 从(1)中我得到了我认为是非常好的树状图,因此我可以从中得到非常好的组。 From (2) and (3) I get the same dendrogram and the groups that I can get out of it aren't as good as the ones I get from (1) 从(2)和(3)中我得到了相同的树状图,我可以从中得到的组不如从(1)中得到的组。

I'm saying that the groups are bad/good because at the moment I have a somewhat little volume of data to analyse, and so I can check them very easily. 我说这些组不好/好,因为目前我要分析的数据量很少,因此我可以很轻松地检查它们。

Does this that I'm getting makes any sense? 我得到的这个有意义吗? There is something that justify this? 有什么理由吗? Some suggestion on how to cluster with a similarity matrizx. 关于如何与相似的matrizx聚类的一些建议。 Is there a better way to visualize a similarity matrix than a dendrogram? 有没有比树状图更好的可视化相似度矩阵的方法?

You can visualize a similarity matrix using a heatmap (for example, using the heatmaply R package). 您可以使用热图(例如,使用heatmaply R包)可视化相似度矩阵。 You can check if a dendrogram fits by using the dendextend R package function cor_cophenetic (use the most recent version from github ). 您可以使用dendextend R包函数cor_cophenetic (使用github的最新版本)检查树状图是否适合。

Clustering which is based on distance can be done using hclust, but also using cluster::pam (k-medoids). 基于距离的聚类可以使用hclust进行,也可以使用cluster :: pam(k-medoids)进行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM