简体   繁体   中英

Scikit-learn: use cross-validation on whole dataset after hyperparameters tuning

I'm using Decision Tree in scikit-learn to classify spam emails. After reading various posts here and elsewhere I have split my initial dataset into training and testing and performed hyperparameters tuning on the training set using cross-validation. In my understanding, the scores should be calcualted on both training and testing to check whether the model is overfitting; considering that the scores on the testing set are good, can I rule this out and present the scores obtained from the whole dataset? Or should I present the results from my testing set instead? Here's the code the for train/test sets:

scores = cross_val_score(tree, x_train, y_train, cv=10)
scores_pre = cross_val_score(tree, x_train, y_train, cv=10, scoring="precision")
scores_f1 = cross_val_score(tree, x_train, y_train, cv=10, scoring="f1")
scores_recall = cross_val_score(tree, x_train, y_train, cv=10, scoring="recall")
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (scores_pre.mean(), scores_pre.std() * 2))
print("F-Measure: %0.2f (+/- %0.2f)" % (scores_f1.mean(), scores_f1.std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (scores_recall.mean(), scores_recall.std() * 2))

Accuracy: 0.97 (+/- 0.02)
Precision: 0.98 (+/- 0.02)
F-Measure: 0.98 (+/- 0.01)
Recall: 0.98 (+/- 0.02)

scores = cross_val_score(tree, x_test, y_test, cv=10)
scores_pre = cross_val_score(tree, x_test, y_test, cv=10, scoring="precision")
scores_f1 = cross_val_score(tree, x_test, y_test, cv=10, scoring="f1")
scores_recall = cross_val_score(tree, x_test, y_test, cv=10, scoring="recall")
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (scores_pre.mean(), scores_pre.std() * 2))
print("F-Measure: %0.2f (+/- %0.2f)" % (scores_f1.mean(), scores_f1.std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (scores_recall.mean(), scores_recall.std() * 2))

Accuracy: 0.95 (+/- 0.03)
Precision: 0.96 (+/- 0.02)
F-Measure: 0.96 (+/- 0.02)
Recall: 0.97 (+/- 0.03)

This is the code for the whole dataset:

scores = cross_val_score(tree, X, y, cv=10)
scores_pre = cross_val_score(tree, X, y, cv=10, scoring="precision")
scores_f1 = cross_val_score(tree, X, y, cv=10, scoring="f1")
scores_recall = cross_val_score(tree, X, y, cv=10, scoring="recall")
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (scores_pre.mean(), scores_pre.std() * 2))
print("F-Measure: %0.2f (+/- %0.2f)" % (scores_f1.mean(), scores_f1.std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (scores_recall.mean(), scores_recall.std() * 2))

Accuracy: 0.97 (+/- 0.04)
Precision: 0.98 (+/- 0.03)
F-Measure: 0.98 (+/- 0.03)
Recall: 0.98 (+/- 0.03)

不,您的最终报告分数应始终排在测试集上,而实际上是验证集。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM