简体   繁体   English

如何从 sklearn.cluster.KMeans 中提取和映射集群索引?

[英]How to extract and map cluster indices from sklearn.cluster.KMeans?

I have a map of data:我有一张数据地图:

import seaborn as sns
import matplotlib.pyplot as plt

X = 101_by_99_float32_array
ax = sns.heatmap(X, square = True)
plt.show()

强度图

Note these data are essentially a 3D surface, and I'm interested in the index positions in X after clustering.请注意,这些数据本质上是一个 3D 表面,我对聚类后X中的索引位置感兴趣。 I can easily apply the kmeans algorithm to my data:我可以轻松地将 kmeans 算法应用于我的数据:

from sklearn.cluster import KMeans
# three clusters is arbitrary; just used for testing purposes
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10).fit(X)

But I am not sure how to navigate kmeans in a way that will identify to which cluster a pixel in the map above belongs.但我不确定如何以一种方式导航kmeans ,以识别上面地图中的像素属于哪个集群。 What I want to do is make a map that looks like the one above, but instead of plotting the z-value for each cell in the 100x99 array X , I'd like to plot the cluster number for each cell in X .我想要做的是制作一个看起来像上面那个的地图,但不是为 100x99 数组X中的每个单元格绘制 z 值,我想绘制X每个单元格的簇号

I don't know if this is possible with the output of the kmeans algorithm, but I did try an approach from the scikitlearn documents here :我不知道这是可能的k均值算法的输出,但我也尝试从scikitlearn文件的方法在这里

import numpy as np
k_means_labels = k_means.labels_
k_means_cluster_centers = k_means.cluster_centers_
k_means_labels_unique = np.unique(k_means_labels)

colors = ['#4EACC5', '#FF9C34', '#4E9A06']
plt.figure()
#plt.hold(True)
for k, col in zip(range(3), colors):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    plt.plot(X[my_members, 0], X[my_members, 1], 'w',
            markerfacecolor=col, marker='.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=6)
plt.title('KMeans')    
plt.show()

在此处输入图片说明

But it's clear this is not accessing the information I want...但很明显这不是访问我想要的信息......

It's obvious I do not fully understanding what each component of the kmeans output represents, and I've tried to read the explanations in the answer to the question found here .很明显,我并没有完全理解kmeans输出的每个组成部分代表什么,并且我试图阅读这里找到的问题的答案中的解释。 However, there's nothing in that answer that explicitly addresses whether the indices of the original data were preserved after clustering, which is really the core of my question.但是,该答案中没有任何内容明确说明聚类后原始数据的索引是否保留,这确实是我问题的核心。 If such information is implicitly present in kmeans through some matrix multiplication, I could really use some help extracting it.如果这些信息通过一些矩阵乘法隐含在kmeans ,我真的可以使用一些帮助来提取它。

Thank you for your time and assistance!感谢您的时间和帮助!

EDIT :编辑

Thanks to @Nakor, for both the explanation about kmeans and the suggestion to reshape my data.感谢@Nakor,对 kmeans 的解释和重塑我的数据的建议。 How kmeans is interpreting my data is now much clearer. kmeans如何解释我的数据现在更加清晰。 I should not expect it to capture the indices of each sample, but instead rely on reshape to do so.我不应该期望它捕获每个样本的索引,而是依靠reshape来做到这一点。 reshape will ravel the original (101,99) matrix into (9999,1) array which, as @Nakor pointed out, is suitable for clustering every entry as an individual sample. reshaperavel原始(101,99)矩阵为(9999,1)阵列,正如@Nakor指出的,适合于聚类的每个条目作为单独的样品。

Simply reapply reshape to kmeans.labels_ using the original shape of the data and I've gotten the result I'm looking for:只需使用数据的原始形状将reshape应用于kmeans.labels_ ,我就得到了我正在寻找的结果:

Y = X.reshape(-1, 1) # shape data to cluster each individual entry 

kmeans= KMeans(init='k-means++', n_clusters=3, n_init=10)
kmeans.fit(Y)

Z = kmeans.labels_
A = Z.reshape(101,99)

plt.figure()
ax = sns.heatmap(cu_map, square = True)
plt.figure()
ay = sns.heatmap(A, square = True)

最后结果

Your issue is that sklearn.cluster.KMeans expects a 2D matrix with [N_samples,N_features] .您的问题是sklearn.cluster.KMeans需要带有[N_samples,N_features]的二维矩阵。 However, you provide the raw image, so sklearn understands you have 101 samples with 99 features each (each row of your image is a sample, and the columns are the features).但是,您提供了原始图像,因此 sklearn 知道您有 101 个样本,每个样本有 99 个特征(图像的每一行都是一个样本,列是特征)。 As a results, what you get in k_means.labels_ is the cluster assignment of each of the rows.结果,您在k_means.labels_得到的是每一行的集群分配。

In you want instead to cluster every single entry, you need to reshape your data like this for instance:如果您想对每个条目进行聚类,则需要像这样重塑您的数据,例如:

model = KMeans(init='k-means++', n_clusters=3, n_init=10)
model.fit(X.reshape(-1,1))

If I check with randomly generated data, I get:如果我检查随机生成的数据,我会得到:

In [1]: len(model.labels_)
Out[1]: 9999

I have one label per entry.我每个条目有一个标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 sklearn.cluster.KMeans是否适合数据点顺序? - Is sklearn.cluster.KMeans sensative to data point order? tslearn.clustering.TimeSeriesKMeans和sklearn.cluster.KMeans - tslearn.clustering.TimeSeriesKMeans vs sklearn.cluster.KMeans sklearn.cluster.KMeans如何处理缺少质心的init ndarray参数(可用质心小于n_clusters)? - How does sklearn.cluster.KMeans handle an init ndarray parameter with missing centroids (available centroids less than n_clusters)? 使用sklearn.cluster.KMeans(python + py2exe)时减少dist目录大小 - Reduce dist directory size while using sklearn.cluster.KMeans (python + py2exe) sklearn.cluster.KMeans 得到“TypeError:__init__() 得到了一个意外的关键字参数‘n_jobs’” - sklearn.cluster.KMeans got "TypeError: __init__() got an unexpected keyword argument 'n_jobs'" 如何使用Sklearn Kmeans聚类稀疏数据 - How to cluster sparse data using Sklearn Kmeans sklearn KMeans中的KMeans.cluster_centers_的值 - Value at KMeans.cluster_centers_ in sklearn KMeans python / sklearn-在执行kmeans之后如何获取集群和集群名称 - python/sklearn - how to get clusters and cluster names after doing kmeans 在python中使用kmeans sklearn集群数据点 - Cluster datapoints using kmeans sklearn in python 定义 k-1 个簇质心——SKlearn KMeans - Define k-1 cluster centroids -- SKlearn KMeans
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM