简体   繁体   中英

How to extract and map cluster indices from sklearn.cluster.KMeans?

I have a map of data:

import seaborn as sns
import matplotlib.pyplot as plt

X = 101_by_99_float32_array
ax = sns.heatmap(X, square = True)
plt.show()

强度图

Note these data are essentially a 3D surface, and I'm interested in the index positions in X after clustering. I can easily apply the kmeans algorithm to my data:

from sklearn.cluster import KMeans
# three clusters is arbitrary; just used for testing purposes
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10).fit(X)

But I am not sure how to navigate kmeans in a way that will identify to which cluster a pixel in the map above belongs. What I want to do is make a map that looks like the one above, but instead of plotting the z-value for each cell in the 100x99 array X , I'd like to plot the cluster number for each cell in X .

I don't know if this is possible with the output of the kmeans algorithm, but I did try an approach from the scikitlearn documents here :

import numpy as np
k_means_labels = k_means.labels_
k_means_cluster_centers = k_means.cluster_centers_
k_means_labels_unique = np.unique(k_means_labels)

colors = ['#4EACC5', '#FF9C34', '#4E9A06']
plt.figure()
#plt.hold(True)
for k, col in zip(range(3), colors):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    plt.plot(X[my_members, 0], X[my_members, 1], 'w',
            markerfacecolor=col, marker='.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=6)
plt.title('KMeans')    
plt.show()

在此处输入图片说明

But it's clear this is not accessing the information I want...

It's obvious I do not fully understanding what each component of the kmeans output represents, and I've tried to read the explanations in the answer to the question found here . However, there's nothing in that answer that explicitly addresses whether the indices of the original data were preserved after clustering, which is really the core of my question. If such information is implicitly present in kmeans through some matrix multiplication, I could really use some help extracting it.

Thank you for your time and assistance!

EDIT :

Thanks to @Nakor, for both the explanation about kmeans and the suggestion to reshape my data. How kmeans is interpreting my data is now much clearer. I should not expect it to capture the indices of each sample, but instead rely on reshape to do so. reshape will ravel the original (101,99) matrix into (9999,1) array which, as @Nakor pointed out, is suitable for clustering every entry as an individual sample.

Simply reapply reshape to kmeans.labels_ using the original shape of the data and I've gotten the result I'm looking for:

Y = X.reshape(-1, 1) # shape data to cluster each individual entry 

kmeans= KMeans(init='k-means++', n_clusters=3, n_init=10)
kmeans.fit(Y)

Z = kmeans.labels_
A = Z.reshape(101,99)

plt.figure()
ax = sns.heatmap(cu_map, square = True)
plt.figure()
ay = sns.heatmap(A, square = True)

最后结果

Your issue is that sklearn.cluster.KMeans expects a 2D matrix with [N_samples,N_features] . However, you provide the raw image, so sklearn understands you have 101 samples with 99 features each (each row of your image is a sample, and the columns are the features). As a results, what you get in k_means.labels_ is the cluster assignment of each of the rows.

In you want instead to cluster every single entry, you need to reshape your data like this for instance:

model = KMeans(init='k-means++', n_clusters=3, n_init=10)
model.fit(X.reshape(-1,1))

If I check with randomly generated data, I get:

In [1]: len(model.labels_)
Out[1]: 9999

I have one label per entry.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM