sklearn: silhouette score different for same clustering

Question

I modified the Cluster comparison script to compute silhouette_score on the clustering output.

I added this line:

sil = silhouette_score(X, y_pred, metric='euclidean') if len(np.unique(y_pred)) > 1 else float('NaN')

and modified the plt.text() line for showing the sil value in the subplot:

txt = 'sil={:.3f}\n{:.2f}s'.format(sil,(t1 - t0))
plt.text(.99, .01, txt, transform=plt.gca().transAxes, size=15, horizontalalignment='right')

This is what I get:

Look at 3rd row, for columns MeanShoft and DBSCAN. Clustering is the same, but silhouette score is significantly lower for DBSCAN. How come?

_{Since this question is not about a programming error, shall this be moved to stats?}

Answer 1

In short, the clusterings aren't the same. If you look very closely at the DBSCAN plot, you'll see that there is an outlier at the bottom left of the blue cluster that is not assigned to any cluster -- it appears as a black point.

Note that the silhouette score assumes that all points are assigned to a cluster, so it may not give the answer you'd expect. In this case, the single point not assigned to any cluster is enough to make a significant difference in the silhouette scores.

sklearn: silhouette score different for same clustering

Question

1 answers

solution1
2 ACCPTED 2015-08-01 15:44:21

sklearn: silhouette score different for same clustering

Question

1 answers

solution1 2 ACCPTED 2015-08-01 15:44:21

solution1
2 ACCPTED 2015-08-01 15:44:21