Scikit-learn: use cross-validation on whole dataset after hyperparameters tuning

Question

I'm using Decision Tree in scikit-learn to classify spam emails. After reading various posts here and elsewhere I have split my initial dataset into training and testing and performed hyperparameters tuning on the training set using cross-validation. In my understanding, the scores should be calcualted on both training and testing to check whether the model is overfitting; considering that the scores on the testing set are good, can I rule this out and present the scores obtained from the whole dataset? Or should I present the results from my testing set instead? Here's the code the for train/test sets:

scores = cross_val_score(tree, x_train, y_train, cv=10)
scores_pre = cross_val_score(tree, x_train, y_train, cv=10, scoring="precision")
scores_f1 = cross_val_score(tree, x_train, y_train, cv=10, scoring="f1")
scores_recall = cross_val_score(tree, x_train, y_train, cv=10, scoring="recall")
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (scores_pre.mean(), scores_pre.std() * 2))
print("F-Measure: %0.2f (+/- %0.2f)" % (scores_f1.mean(), scores_f1.std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (scores_recall.mean(), scores_recall.std() * 2))

Accuracy: 0.97 (+/- 0.02)
Precision: 0.98 (+/- 0.02)
F-Measure: 0.98 (+/- 0.01)
Recall: 0.98 (+/- 0.02)

scores = cross_val_score(tree, x_test, y_test, cv=10)
scores_pre = cross_val_score(tree, x_test, y_test, cv=10, scoring="precision")
scores_f1 = cross_val_score(tree, x_test, y_test, cv=10, scoring="f1")
scores_recall = cross_val_score(tree, x_test, y_test, cv=10, scoring="recall")
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (scores_pre.mean(), scores_pre.std() * 2))
print("F-Measure: %0.2f (+/- %0.2f)" % (scores_f1.mean(), scores_f1.std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (scores_recall.mean(), scores_recall.std() * 2))

Accuracy: 0.95 (+/- 0.03)
Precision: 0.96 (+/- 0.02)
F-Measure: 0.96 (+/- 0.02)
Recall: 0.97 (+/- 0.03)

This is the code for the whole dataset:

scores = cross_val_score(tree, X, y, cv=10)
scores_pre = cross_val_score(tree, X, y, cv=10, scoring="precision")
scores_f1 = cross_val_score(tree, X, y, cv=10, scoring="f1")
scores_recall = cross_val_score(tree, X, y, cv=10, scoring="recall")
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (scores_pre.mean(), scores_pre.std() * 2))
print("F-Measure: %0.2f (+/- %0.2f)" % (scores_f1.mean(), scores_f1.std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (scores_recall.mean(), scores_recall.std() * 2))

Accuracy: 0.97 (+/- 0.04)
Precision: 0.98 (+/- 0.03)
F-Measure: 0.98 (+/- 0.03)
Recall: 0.98 (+/- 0.03)

Answer 1

不，您的最终报告分数应始终排在测试集上，而实际上是验证集。

Scikit-learn: use cross-validation on whole dataset after hyperparameters tuning

Question

1 answers

solution1
0 2019-03-04 10:51:26

Scikit-learn: use cross-validation on whole dataset after hyperparameters tuning

Question

1 answers

solution1 0 2019-03-04 10:51:26

solution1
0 2019-03-04 10:51:26