K-Fold Cross Validation on entire Dataset

Question

I would like to know if my current procedure is correct, or I might be having data leaks. After importing the dataset, I split with 80/20 ratio.

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=0, stratify=y)

Then after defining a CatBoostClassifier, I perform GridSearch with cross-validation with my training set.

clf = CatBoostClassifier(leaf_estimation_iterations=1, border_count=254, scale_pos_weight=1.67)
grid = {'learning_rate': [0.001, 0.003, 0.006,0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 0.9],
     'depth': [1, 2,3,4,5, 6,7,8,9, 10],
     'l2_leaf_reg': [1, 3, 5, 7, 9,11,13,15],
      'iterations': [50,150,250,350,450,600, 800,1000]}
clf.grid_search(grid,
             X=X_train,
             y=y_train, cv=10)

Now I want to evaluate my model. Can I now use the entire dataset to perform k-fold cross-validation, in order to evaluate the model? (like in the code below)

kf = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=0)
scoring = ['accuracy', 'f1', 'roc_auc', 'recall', 'precision']
scores = cross_validate(
    clf, X, y, scoring=scoring, cv=kf, return_train_score=True)
print("Accuracy TEST: %0.2f (+/- %0.2f) Accuracy TRAIN: %0.2f (+/- %0.2f)" %
      (scores['test_accuracy'].mean(), scores['test_accuracy'].std() * 2, scores['train_accuracy'].mean(), scores['train_accuracy'].std() * 2))
print("F1 TEST: %0.2f (+/- %0.2f) F1 TRAIN : %0.2f (+/- %0.2f) " %
      (scores['test_f1'].mean(), scores['test_f1'].std() * 2, scores['train_f1'].mean(), scores['train_f1'].std() * 2))
print("AUROC TEST: %0.2f (+/- %0.2f) AUROC TRAIN : %0.2f (+/- %0.2f)" %
      (scores['test_roc_auc'].mean(), scores['test_roc_auc'].std() * 2, scores['train_roc_auc'].mean(), scores['train_roc_auc'].std() * 2))
print("recall TEST: %0.2f (+/- %0.2f) recall TRAIN: %0.2f (+/- %0.2f)" %
      (scores['test_recall'].mean(), scores['test_recall'].std() * 2, scores['train_recall'].mean(), scores['train_recall'].std() * 2))
print("Precision TEST: %0.2f (+/- %0.2f) Precision TRAIN: %0.2f (+/- %0.2f)" %
      (scores['test_precision'].mean(), scores['test_precision'].std() * 2, scores['train_precision'].mean(), scores['train_precision'].std() * 2))

Or should I perform the k-fold cross-validation only on training set as well?

Answer 1

You usually do the cross-validation as part of your training procedure. It is intended to find good parameters of your model. Only then, at the end, you should evaluate your model on the test set - data that was not seen by the model before, even during cross-validation. This way you won't leak any data.

So yes, you should perform cross-validation on the training set only. And use the test set for final evaluation only.

K-Fold Cross Validation on entire Dataset

Question

1 answers

solution1
1 ACCPTED 2020-05-24 09:11:42

K-Fold Cross Validation on entire Dataset

Question

1 answers

solution1 1 ACCPTED 2020-05-24 09:11:42

solution1
1 ACCPTED 2020-05-24 09:11:42