简体   繁体   中英

Why is cross_val_score different to when I calculate it manually?

Here is the reproducible example code:

from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import balanced_accuracy_score

# define dataset
X, y = make_classification(n_samples=1000, weights = [0.3,0.7], n_features=100, n_informative=75, random_state=0)
# define the model
model = RandomForestClassifier(n_estimators=10, random_state=0)
# evaluate the model
n_splits=10
cv = StratifiedShuffleSplit(n_splits, random_state=0)
n_scores = cross_validate(model, X, y, scoring='balanced_accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %0.4f' % (mean(n_scores['test_score'])))

bal_acc_sum = []
for train_index, test_index in cv.split(X,y):
    model.fit(X[train_index], y[train_index])                                      
    bal_acc_sum.append(balanced_accuracy_score(model.predict(X[test_index]),y[test_index]))

print(f"Accuracy: %0.4f" % (mean(bal_acc_sum)))

Result:

Accuracy: 0.6737
Accuracy: 0.7113

The results for my self calculated accuracy is always higher than the one cross-validation gives me. But it should be the same or am I missing something? Same metric, same split (KFold brings same result), same fixed model (other models behave identically), same random state, but different results?

It is because, in your manual calculation, you have flipped the order of arguments in balanced_accuracy_score , which matters - it should be (y_true, y_pred) ( docs ).

Changing this, your manual calculation becomes:

bal_acc_sum = []
for train_index, test_index in cv.split(X,y):
    model.fit(X[train_index], y[train_index])                                      
    bal_acc_sum.append(balanced_accuracy_score(y[test_index], model.predict(X[test_index])))  # change order of arguments here

print(f"Accuracy: %0.4f" % (mean(bal_acc_sum)))

Result:

Accuracy: 0.6737

And

import numpy as np
np.all(bal_acc_sum==n_scores['test_score'])
# True

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM