简体   繁体   English

有哪些评估聚类相似性的方法?

[英]What ways of assessing similarity of clusterings are there?

Suppose I have two ways of clustering the same dataset, and want to calculate the similarity of their outputs.假设我有两种聚类相同数据集的方法,并且想要计算它们输出的相似性。 I would have to calculate something akin to a correlation, but cluster labels are a categorical variable.我必须计算一些类似于相关性的东西,但集群标签是一个分类变量。 I thought about using chi-square, but it's not advised to do so when multiple cells in the contingency table are <5 (and this will happen often when clusterings are very similar).我考虑过使用卡方,但是当列联表中的多个单元格小于 5 时,不建议这样做(这在聚类非常相似时经常发生)。 Another clue was to use Fisher's exact test, but the Python scipy implementation works only for 2x2 contingency matrices, and I will likely be working with bigger matrices (10x10, or 8x6 for example).另一个线索是使用 Fisher 的精确测试,但 Python scipy 实现仅适用于 2x2 列联矩阵,我可能会使用更大的矩阵(例如 10x10 或 8x6)。

Are there any other established methods of comparing clusterings in this way?是否有其他既定的方法来以这种方式比较聚类? Are there any Python implementations of them?有没有它们的 Python 实现?

Excellent Python implementations exist at https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation .优秀的 Python 实现存在于https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation Each has its own upsides and downsides.每个都有自己的优点和缺点。 Both for comparing a clustering result to its ground truth labels (external) and evaluating a clustering result on criteria such as distance between cluster centroids (internal) and more.既用于将聚类结果与其真实标签(外部)进行比较,也用于根据聚类质心之间的距离(内部)等标准评估聚类结果。 The contingency matrix gives an excellent insight into your clustering, but does not give a numerical value to 'prove your clustering was good'.列联矩阵可以很好地洞察您的聚类,但没有给出“证明您的聚类良好”的数值。

If your dataset is very large with many dimensions the internal validation measures might be prohibitively slow.如果您的数据集非常大且具有许多维度,则内部验证措施可能会非常缓慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM