简体   繁体   English

具有大量集群的 KMeans

[英]KMeans with huge number of clusters

I have a relatively big graph that has around 6000 vertices and I have to use KMeans and see what are the 5467 clusters.我有一个相对较大的图,它有大约 6000 个顶点,我必须使用 KMeans 并查看 5467 个簇是什么。 I have to use a different metric that's why I gave the distance_matrix as an input.我必须使用不同的度量标准,这就是我将 distance_matrix 作为输入的原因。 The problem with this is that since n_clusters is too big it doesn't converge.这样做的问题是,由于 n_clusters 太大,它不会收敛。 I was advised to make custom adaptations in order t make it work, but I'm not really sure what that means.有人建议我进行自定义调整以使其正常工作,但我不确定这意味着什么。 That's why I am posting this question here.这就是为什么我在这里发布这个问题。 Any help is welcomed: Thank you!欢迎任何帮助:谢谢! Here is my code:这是我的代码:

from sklearn.cluster import KMeans

distance_matrix = floyd_warshall_numpy(G)

cluster = KMeans(n_clusters=5467)

cluster.fit(distance_matrix)

graph_labels = cluster.labels_

I would not advise having such a high number of clusters with Kmeans.我不建议使用 Kmeans 进行如此多的集群。 Instead, try using Agglomerative clustering with euclidean distance.相反,请尝试使用具有欧几里德距离的凝聚聚类 This would allow you to find a cutoff where you can get the expected number of clusters by grouping points.这将允许您找到一个截止点,您可以通过对点进行分组来获得预期的集群数量。

在此处输入图像描述

Cutting if off at 5 would give you 4 clusters while curring it off at 2 would give you more.如果在 5 处关闭,则会给你 4 个集群,而在 2 处关闭它会给你更多。

Dummy code -虚拟代码 -

from sklearn.cluster import AgglomerativeClustering
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],[4, 2], [4, 4], [4, 0]])
clustering = AgglomerativeClustering().fit(X)
clustering.labels_
array([1, 1, 1, 0, 0, 0])

You can use a pre-computed matrix for agglomerative clustering for the same as well您也可以使用预先计算的矩阵进行凝聚聚类

Check the documentation link that I have shared.检查我共享的文档链接。

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM