Cluster datapoints using kmeans sklearn in python

Question

I am using the following python code to cluster my datapoints using kmeans.

data =  np.array([[30, 17, 10, 32, 32], [18, 20, 6, 20, 15], [10, 8, 10, 20, 21], [3, 16, 20, 10, 17], [3, 15, 21, 17, 20]])
kmeans_clustering = KMeans( n_clusters = 3 )
idx = kmeans_clustering.fit_predict( data )

#use t-sne
X = TSNE(n_components=2).fit_transform( data )

fig = plt.figure(1)
plt.clf()

#plot graph
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
plt.scatter(X[:,0], X[:,1], c=colors[kmeans_clustering.labels_])
plt.title('K-Means (t-SNE)')
plt.show()

However, the plot of the clusters I get is wrong as I get everything in one point. 聚类图

Hence, please let me know where I am making my code wrong? I want to view the kmeans clusters seperately in my scatter plot.

EDIT

The t-sne vales I get are as follows.

[[  1.12758575e-04   9.30458337e-05]
 [ -1.82559784e-04  -1.06657936e-04]
 [ -9.56485652e-05  -2.38951623e-04]
 [  5.56515580e-05  -4.42453191e-07]
 [ -1.42039677e-04  -5.62548119e-05]]

Answer 1

Use the perplexity parameter of the TSNE . The default value of the perplexity is 30, it seems that's too much for your case, even though the documentation states that TSNE is quite insensitive to this parameter.

The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. The choice is not extremely critical since t-SNE is quite insensitive to this parameter.

X = TSNE(n_components=2, perplexity=2.0).fit_transform( data )

Answer 2

You could also use PCA (Principal Components Analysis) instead of t-SNE to plot your clusters:

import numpy as np
import pandas as pd  
from sklearn.cluster import Kmeans
from sklearn.decomposition import PCA

data =  np.array([[30, 17, 10, 32, 32], [18, 20, 6, 20, 15], [10, 8, 10, 20, 
21], [3, 16, 20, 10, 17], [3, 15, 21, 17, 20]])
kmeans = KMeans(n_clusters = 3)
labels = kmeans.fit_predict(data)    

pca = PCA(n_components=2)
data_reduced = pca.fit_transform(data)
data_reduced = pd.DataFrame(data_reduced)

ax = data_reduced.plot(kind='scatter', x=0, y=1, c=labels, cmap='rainbow')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_title('Projection of the clustering on a the axis of the PCA')

for x, y, label in zip(data_reduced[0], data_reduced[1], kmeans.labels_):
    ax.annotate('Cluster {0}'.format(label), (x,y))

Cluster datapoints using kmeans sklearn in python

Question

2 answers

solution1
2 ACCPTED 2018-01-05 10:07:04

solution2
2 2018-01-12 20:19:38

Cluster datapoints using kmeans sklearn in python

Question

2 answers

solution1 2 ACCPTED 2018-01-05 10:07:04

solution2 2 2018-01-12 20:19:38

solution1
2 ACCPTED 2018-01-05 10:07:04

solution2
2 2018-01-12 20:19:38