简体   繁体   中英

Cluster datapoints using kmeans sklearn in python

I am using the following python code to cluster my datapoints using kmeans.

data =  np.array([[30, 17, 10, 32, 32], [18, 20, 6, 20, 15], [10, 8, 10, 20, 21], [3, 16, 20, 10, 17], [3, 15, 21, 17, 20]])
kmeans_clustering = KMeans( n_clusters = 3 )
idx = kmeans_clustering.fit_predict( data )

#use t-sne
X = TSNE(n_components=2).fit_transform( data )

fig = plt.figure(1)
plt.clf()

#plot graph
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
plt.scatter(X[:,0], X[:,1], c=colors[kmeans_clustering.labels_])
plt.title('K-Means (t-SNE)')
plt.show()

However, the plot of the clusters I get is wrong as I get everything in one point. 聚类图

Hence, please let me know where I am making my code wrong? I want to view the kmeans clusters seperately in my scatter plot.

EDIT

The t-sne vales I get are as follows.

[[  1.12758575e-04   9.30458337e-05]
 [ -1.82559784e-04  -1.06657936e-04]
 [ -9.56485652e-05  -2.38951623e-04]
 [  5.56515580e-05  -4.42453191e-07]
 [ -1.42039677e-04  -5.62548119e-05]]

Use the perplexity parameter of the TSNE . The default value of the perplexity is 30, it seems that's too much for your case, even though the documentation states that TSNE is quite insensitive to this parameter.

The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. The choice is not extremely critical since t-SNE is quite insensitive to this parameter.

X = TSNE(n_components=2, perplexity=2.0).fit_transform( data )

在此处输入图片说明

You could also use PCA (Principal Components Analysis) instead of t-SNE to plot your clusters:

import numpy as np
import pandas as pd  
from sklearn.cluster import Kmeans
from sklearn.decomposition import PCA

data =  np.array([[30, 17, 10, 32, 32], [18, 20, 6, 20, 15], [10, 8, 10, 20, 
21], [3, 16, 20, 10, 17], [3, 15, 21, 17, 20]])
kmeans = KMeans(n_clusters = 3)
labels = kmeans.fit_predict(data)    

pca = PCA(n_components=2)
data_reduced = pca.fit_transform(data)
data_reduced = pd.DataFrame(data_reduced)

ax = data_reduced.plot(kind='scatter', x=0, y=1, c=labels, cmap='rainbow')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_title('Projection of the clustering on a the axis of the PCA')

for x, y, label in zip(data_reduced[0], data_reduced[1], kmeans.labels_):
    ax.annotate('Cluster {0}'.format(label), (x,y))

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM