简体   繁体   中英

sklearn: silhouette score different for same clustering

I modified the Cluster comparison script to compute silhouette_score on the clustering output.

I added this line:

sil = silhouette_score(X, y_pred, metric='euclidean') if len(np.unique(y_pred)) > 1 else float('NaN')

and modified the plt.text() line for showing the sil value in the subplot:

txt = 'sil={:.3f}\n{:.2f}s'.format(sil,(t1 - t0))
plt.text(.99, .01, txt, transform=plt.gca().transAxes, size=15, horizontalalignment='right')

This is what I get:

1个

Look at 3rd row, for columns MeanShoft and DBSCAN. Clustering is the same, but silhouette score is significantly lower for DBSCAN. How come?

Since this question is not about a programming error, shall this be moved to stats?

In short, the clusterings aren't the same. If you look very closely at the DBSCAN plot, you'll see that there is an outlier at the bottom left of the blue cluster that is not assigned to any cluster -- it appears as a black point.

Note that the silhouette score assumes that all points are assigned to a cluster, so it may not give the answer you'd expect. In this case, the single point not assigned to any cluster is enough to make a significant difference in the silhouette scores.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM