简体   繁体   中英

Python Kmeans cluster order for classification

I am training with machine learning classification prediction algorithms. I am testing different methods between logistic regression, Knn or predictions based on KMeans centroids to assign caterory.

Everything worked perfectly except Kmeans inverted the labels 0 and 1. The results are still correct, just that the categories no longer correspond. The confusion matrix is therefore reversed between True and False and also my accuracy score instead of 99%, it is now at 1%

Cluster 0 has to be the one related to the False and the cluster 1 for True. In addition, statistically the True outnumber the False in this dataset but maybe not in another one.

Is there any solution to fix the labels before or reassign the Kmeans cluster labels?

I don't have have this issue with Knn or logistic regression whose categories correspond well to 0 and 1.

Here is my code for a dataframe 1500 rows, 6 columns in order to predict the category between 0 and 1, either between True or False:

# Kmeans model initialization
km = KMeans(n_clusters=2)
km.fit(X_train_std)

# centroids definition
centroid = km.cluster_centers_    
c_km = pd.DataFrame(centroid, columns=X_name)

# prediction pour 2 clusters
y_pred_km = km.predict(X_test_std)

# model training
pred['pred_km'] =  y_pred_km
pred['is_genuine_km'] = pred['pred_km'].apply(lambda x: True if x >0 else False)

# plot the confusion matrix & accuracy score
fig, ax = plt.subplots(1,1)
cm_km = metrics.confusion_matrix(y_test, y_pred_km)
cm_display_km = metrics.ConfusionMatrixDisplay(cm_km, display_labels=['False', 'True'])
cm_display_km.plot(ax=ax)
ax.set_title('K-Means Confusion Matrix \n Accuracy =  %0.3f' % metrics.accuracy_score(y_test, y_pred_km))
plt.show()

I assume you use scikit-learn. In this case, you can pass km = KMeans(n_clusters=2, random_state=42) to the function to seed the random number generator, so it delivers the same clustering in each run.

See KMeans documentation for the random_state parameter:

Use an int to make the randomness deterministic.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM