简体繁体 English

R中相异矩阵的聚类

[英]Clustering on dissimilarity matrix in R

原文 2017-09-06 22:21:44 2 1 r/ cluster-analysis/ k-means

I'm currently try to get my head around unsuperivsed machine learning, ie clustering, and get a bit confused. 我目前正试图了解未经过时间的机器学习，即聚类，并且有点困惑。

First of all, here is why I need a cluster algorithm. 首先，这就是我需要一个聚类算法的原因。 I computed a dissimilarity matrix N x N, where I compare the (dis)similarity of binary trees. 我计算了相异度矩阵 N x N，其中我比较了二元树的（dis）相似性。 That means for the entry N _i,i the value is zero (means the diagonal is zero) and for the entry N _i,j the value is ≥ 0. This is a matrix which contains 100 x 100 elements, ie I have 100 binary trees which I compare with each other. 这意味着对于条目N _i，i值为零（表示对角线为零），对于条目N _i，j ，值为≥0。这是一个包含100 x 100个元素的矩阵，即我有100个二进制我相互比较的树木。 This matrix gets computed outside of R. The distances in my matrix are tree edit distances and satisfying the triangle inequality . 该矩阵在R外部计算。矩阵中的距离是树编辑距离并满足三角不等式 。

Which clustering algorithm I'm actually allowed to use with just these information? 我实际上允许使用哪种聚类算法只使用这些信息？ I'm pretty sure I can use hierarchical clustering, but how would I perform a k-means oder PAM clustering in R with just this matrix? 我很确定我可以使用层次聚类，但是如何使用这个矩阵在R中执行k-means oder PAM聚类？

1 个解决方案

You can't use k-means. 你不能使用k-means。 Because it needs to compute the means, and the distance from the mean. 因为它需要计算均值，以及与均值的距离。 That won't work on trees. 这不适用于树木。

HAC, PAM and DBSCAN are fine. HAC，PAM和DBSCAN都很好。 DBSCAN is the most scalable of these three, but also works better if you have enough data - your sample may be too small for this. DBSCAN是这三者中最具扩展性的，但如果你有足够的数据也会更好 - 你的样本可能太小了。 So I'd use HAC. 所以我会使用HAC。