为什么我的交叉验证始终比训练测试拆分表现更好？

Question

I have the code below (using sklearn) that first uses the training set for cross-validation, and for a final check, uses the test set.我有下面的代码（使用 sklearn），它首先使用训练集进行交叉验证，并使用测试集进行最终检查。 However, the cross-validation consistently perform better as shown below.但是，交叉验证始终表现更好，如下所示。 Am I over-fitting on the training data?我在训练数据上过度拟合了吗？ And if so which hyper parameter(s) would be best to modify to avoid this?如果是这样，最好修改哪个超参数以避免这种情况？

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#Cross validation
rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)   
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc' }
scores = cross_validate(rfc, X_train, y_train, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
      mean(scores['test_precision']),
      mean(scores['test_recall']),
      mean(scores['test_f1']),
      mean(scores['test_roc_auc'])
      )

which gives me:这给了我：

0.8536558341101569 0.8641939667622551 0.8392201023654705 0.8514895113569482 0.9264002192260914 0.8536558341101569 0.8641939667622551 0.8392201023654705 0.8514895113569482 0.9264002192260914

#re-train the model now with the entire training+validation set, and test it with never-seen-before test-set
RFC = RandomForestClassifier()

RFC.fit(X_train, y_train)
y_pred = RFC.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
y_pred_proba = RFC.predict_proba(X_test)[::,1] 
auc = roc_auc_score(y_test, y_pred_proba)

print(accuracy,
      precision,
      recall,
      f1,
      auc
      )

Now gives me the numbers below, which are clearly worse:现在给我下面的数字，这些数字显然更糟：

0.7809788654060067 0.5113236034222446 0.5044687189672294 0.5078730317420644 0.7589037004728368 0.7809788654060067 0.5113236034222446 0.5044687189672294 0.5078730317420644 0.7589037004728368

Answer 1

I am able to reproduce your scenario with Pima Indians Diabetes Dataset .我可以使用Pima Indians Diabetes Dataset重现您的情况。

The difference you see in the prediction metrics is not consistence and in some runs you may even notice the opposite, because it depends on the selection of the X_test during the split - some of the cases will be easier to predict and will give better metrics and vice versa.您在预测指标中看到的差异不是一致性，在某些运行中您甚至可能会注意到相反的情况，因为这取决于拆分期间 X_test 的选择 - 有些情况更容易预测并且会提供更好的指标和反之亦然。 While Cross-validation runs predictions on the whole set you have in rotation and aggregates this effect, the single X_test set will suffer from effects of random splits.虽然交叉验证在轮换的整个集合上运行预测并聚合此效果，但单个 X_test 集将受到随机拆分的影响。

In order to have better visibility on what is happening here, I have modified your experiment and split in two steps:为了更好地了解这里发生的事情，我修改了您的实验并分为两个步骤：

1. Cross-validation step: 1. 交叉验证步骤：

I use the whole of the X and y sets and run rest of the code as it is我使用整个 X 和 y 集并按原样运行其余代码

rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)
# cv = KFold(n_splits=10)
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}
scores = cross_validate(rfc, X, y, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
      mean(scores['test_precision']),
      mean(scores['test_recall']),
      mean(scores['test_f1']),
      mean(scores['test_roc_auc'])
      )

Output:输出：

0.768257006151743 0.6943032069967433 0.593436328663432 0.6357667086829574 0.8221242747913622

2. Classic train-test step: 2.经典的train-test步骤：

Next I run the plain train-test step, but I do it 50 times with the different train_test splits, and average the metrics (similar to Cross-validation step).接下来，我运行普通的训练测试步骤，但我使用不同的 train_test 拆分进行了 50 次，并对指标求平均值（类似于交叉验证步骤）。

accuracies = []
precisions = []
recalls = []
f1s = []
aucs = []

for i in range(50):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    RFC = RandomForestClassifier()

    RFC.fit(X_train, y_train)
    y_pred = RFC.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    y_pred_proba = RFC.predict_proba(X_test)[::, 1]
    auc = roc_auc_score(y_test, y_pred_proba)
    accuracies.append(accuracy)
    precisions.append(precision)
    recalls.append(recall)
    f1s.append(f1)
    aucs.append(auc)

print(mean(accuracies),
      mean(precisions),
      mean(recalls),
      mean(f1s),
      mean(aucs)
      )

Output:输出：

0.7606926406926405 0.7001931059992001 0.5778712922956755 0.6306501622080503 0.8207846633339568

As expected the prediction metrics are similar.正如预期的那样，预测指标是相似的。 However, the Cross-validation runs much faster and uses each data point of the whole data set for testing (in rotation) by a given number of times.然而，交叉验证运行得更快，并使用整个数据集的每个数据点进行测试（轮流）给定次数。

为什么我的交叉验证始终比训练测试拆分表现更好？

问题描述

1 个解决方案

解决方案1
0 2021-11-14 13:36:29

1. Cross-validation step: 1. 交叉验证步骤：

2. Classic train-test step: 2.经典的train-test步骤：

为什么我的交叉验证始终比训练测试拆分表现更好？

问题描述

1 个解决方案

解决方案1 0 2021-11-14 13:36:29

1. Cross-validation step: 1. 交叉验证步骤：

2. Classic train-test step: 2.经典的train-test步骤：

解决方案1
0 2021-11-14 13:36:29