简体   繁体   中英

K-Fold Cross Validation on entire Dataset

I would like to know if my current procedure is correct, or I might be having data leaks. After importing the dataset, I split with 80/20 ratio.

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=0, stratify=y)

Then after defining a CatBoostClassifier, I perform GridSearch with cross-validation with my training set.

clf = CatBoostClassifier(leaf_estimation_iterations=1, border_count=254, scale_pos_weight=1.67)
grid = {'learning_rate': [0.001, 0.003, 0.006,0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 0.9],
     'depth': [1, 2,3,4,5, 6,7,8,9, 10],
     'l2_leaf_reg': [1, 3, 5, 7, 9,11,13,15],
      'iterations': [50,150,250,350,450,600, 800,1000]}
clf.grid_search(grid,
             X=X_train,
             y=y_train, cv=10)

Now I want to evaluate my model. Can I now use the entire dataset to perform k-fold cross-validation, in order to evaluate the model? (like in the code below)

kf = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=0)
scoring = ['accuracy', 'f1', 'roc_auc', 'recall', 'precision']
scores = cross_validate(
    clf, X, y, scoring=scoring, cv=kf, return_train_score=True)
print("Accuracy TEST: %0.2f (+/- %0.2f) Accuracy TRAIN: %0.2f (+/- %0.2f)" %
      (scores['test_accuracy'].mean(), scores['test_accuracy'].std() * 2, scores['train_accuracy'].mean(), scores['train_accuracy'].std() * 2))
print("F1 TEST: %0.2f (+/- %0.2f) F1 TRAIN : %0.2f (+/- %0.2f) " %
      (scores['test_f1'].mean(), scores['test_f1'].std() * 2, scores['train_f1'].mean(), scores['train_f1'].std() * 2))
print("AUROC TEST: %0.2f (+/- %0.2f) AUROC TRAIN : %0.2f (+/- %0.2f)" %
      (scores['test_roc_auc'].mean(), scores['test_roc_auc'].std() * 2, scores['train_roc_auc'].mean(), scores['train_roc_auc'].std() * 2))
print("recall TEST: %0.2f (+/- %0.2f) recall TRAIN: %0.2f (+/- %0.2f)" %
      (scores['test_recall'].mean(), scores['test_recall'].std() * 2, scores['train_recall'].mean(), scores['train_recall'].std() * 2))
print("Precision TEST: %0.2f (+/- %0.2f) Precision TRAIN: %0.2f (+/- %0.2f)" %
      (scores['test_precision'].mean(), scores['test_precision'].std() * 2, scores['train_precision'].mean(), scores['train_precision'].std() * 2))

Or should I perform the k-fold cross-validation only on training set as well?

You usually do the cross-validation as part of your training procedure. It is intended to find good parameters of your model. Only then, at the end, you should evaluate your model on the test set - data that was not seen by the model before, even during cross-validation. This way you won't leak any data.

So yes, you should perform cross-validation on the training set only. And use the test set for final evaluation only.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM