简体繁体 English

R：使用相似性或相异性矩阵进行聚类？并可视化结果

[英]R: clustering with a similarity or dissimilarity matrix? And visualizing the results

原文 2017-07-12 15:06:28 8 1 r/ matrix/ similarity/ dendrogram

I have a similarity matrix that I created using Harry —a tool for string similarity, and I wanted to plot some dendrograms out of it to see if I could find some clusters / groups in the data. 我有一个使用Harry创建的相似度矩阵-一种用于字符串相似度的工具，我想从中绘制一些树状图，以查看是否可以在数据中找到某些聚类/组。 I'm using the following similarity measures: 我正在使用以下相似性度量：

Normalized compression distance (NCD) 归一化压缩距离（NCD）
Damerau-Levenshtein distance Damerau-Levenshtein距离
Jaro-Winkler distance Jaro-Winkler距离
Levenshtein distance 莱文施泰因距离
Optimal string alignment distance (OSA) 最佳字符串对齐距离（OSA）

("For comparison Harry loads a set of strings from input, computes the specified similarity measure and writes a matrix of similarity values to output") （“为了进行比较，Harry从输入中加载了一组字符串，计算了指定的相似性度量，并将相似性值矩阵写入输出中”）

At first, it was like my first time using R, I didn't pay to much attention on the documentation of hclust , so I used it with a similarity matrix . 一开始，就像我第一次使用R一样，我对hclust的文档并没有太hclust ，所以我将它与相似矩阵一起使用 。 I know I should have used a dissimilarity matrix , and I know, since my similarity matrix is normalized [0,1], that I could just do dissimilarity = 1 - similarity and then use hclust . 我知道我应该使用不相似矩阵 ，并且由于我的相似矩阵被归一化[0,1]，所以我可以做不相似= 1-相似 ，然后使用hclust 。

But, the groups that I get using hclust with a similarity matrix are much better than the ones I get using hclust and it's correspondent dissimilarity matrix . 但是，我将hclust与相似性矩阵组合使用的组要比我对hclust及其对应的相异性矩阵组合的组合要好得多。

I tried to use the proxy package as well and the same problem, the groups that I get aren't what I expected, happens. 我也尝试使用proxy程序包，并且发生了相同的问题，即我得到的组不是我期望的。

To get the dendrograms using the similarity function I do: 要使用相似度函数获取树状图，我要做：

plot(hclust(as.dist(""similarityMATRIX""), "average"))

With the dissimilarity matrix I tried: 使用差异矩阵，我尝试过：

plot(hclust(as.dist(""dissimilarityMATRIX""), "average"))

and 和

plot(hclust(as.sim(""dissimilarityMATRIX""), "average"))

From (1) I get what I believe to be a very good dendrogram, and so I can get very good groups out of it. 从（1）中我得到了我认为是非常好的树状图，因此我可以从中得到非常好的组。 From (2) and (3) I get the same dendrogram and the groups that I can get out of it aren't as good as the ones I get from (1) 从（2）和（3）中我得到了相同的树状图，我可以从中得到的组不如从（1）中得到的组。

I'm saying that the groups are bad/good because at the moment I have a somewhat little volume of data to analyse, and so I can check them very easily. 我说这些组不好/好，因为目前我要分析的数据量很少，因此我可以很轻松地检查它们。

Does this that I'm getting makes any sense? 我得到的这个有意义吗？ There is something that justify this? 有什么理由吗？ Some suggestion on how to cluster with a similarity matrizx. 关于如何与相似的matrizx聚类的一些建议。 Is there a better way to visualize a similarity matrix than a dendrogram? 有没有比树状图更好的可视化相似度矩阵的方法？

1 个解决方案

You can visualize a similarity matrix using a heatmap (for example, using the heatmaply R package). 您可以使用热图（例如，使用heatmaply R包）可视化相似度矩阵。 You can check if a dendrogram fits by using the dendextend R package function cor_cophenetic (use the most recent version from github ). 您可以使用dendextend R包函数cor_cophenetic （使用github的最新版本）检查树状图是否适合。

Clustering which is based on distance can be done using hclust, but also using cluster::pam (k-medoids). 基于距离的聚类可以使用hclust进行，也可以使用cluster :: pam（k-medoids）进行。