简体繁体 English

有哪些评估聚类相似性的方法？

[英]What ways of assessing similarity of clusterings are there?

原文 2020-03-30 14:29:57 2 1 python/ statistics/ cluster-analysis/ evaluation

Suppose I have two ways of clustering the same dataset, and want to calculate the similarity of their outputs.假设我有两种聚类相同数据集的方法，并且想要计算它们输出的相似性。 I would have to calculate something akin to a correlation, but cluster labels are a categorical variable.我必须计算一些类似于相关性的东西，但集群标签是一个分类变量。 I thought about using chi-square, but it's not advised to do so when multiple cells in the contingency table are <5 (and this will happen often when clusterings are very similar).我考虑过使用卡方，但是当列联表中的多个单元格小于 5 时，不建议这样做（这在聚类非常相似时经常发生）。 Another clue was to use Fisher's exact test, but the Python scipy implementation works only for 2x2 contingency matrices, and I will likely be working with bigger matrices (10x10, or 8x6 for example).另一个线索是使用 Fisher 的精确测试，但 Python scipy 实现仅适用于 2x2 列联矩阵，我可能会使用更大的矩阵（例如 10x10 或 8x6）。

Are there any other established methods of comparing clusterings in this way?是否有其他既定的方法来以这种方式比较聚类？ Are there any Python implementations of them?有没有它们的 Python 实现？

1 个解决方案

Excellent Python implementations exist at https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation .优秀的 Python 实现存在于https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation 。 Each has its own upsides and downsides.每个都有自己的优点和缺点。 Both for comparing a clustering result to its ground truth labels (external) and evaluating a clustering result on criteria such as distance between cluster centroids (internal) and more.既用于将聚类结果与其真实标签（外部）进行比较，也用于根据聚类质心之间的距离（内部）等标准评估聚类结果。 The contingency matrix gives an excellent insight into your clustering, but does not give a numerical value to 'prove your clustering was good'.列联矩阵可以很好地洞察您的聚类，但没有给出“证明您的聚类良好”的数值。

If your dataset is very large with many dimensions the internal validation measures might be prohibitively slow.如果您的数据集非常大且具有许多维度，则内部验证措施可能会非常缓慢。