简体   繁体   English

为什么我的交叉验证始终比训练测试拆分表现更好?

[英]Why does my cross-validation consistently perform better than train-test split?

I have the code below (using sklearn) that first uses the training set for cross-validation, and for a final check, uses the test set.我有下面的代码(使用 sklearn),它首先使用训练集进行交叉验证,并使用测试集进行最终检查。 However, the cross-validation consistently perform better as shown below.但是,交叉验证始终表现更好,如下所示。 Am I over-fitting on the training data?我在训练数据上过度拟合了吗? And if so which hyper parameter(s) would be best to modify to avoid this?如果是这样,最好修改哪个超参数以避免这种情况?

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#Cross validation
rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)   
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc' }
scores = cross_validate(rfc, X_train, y_train, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
      mean(scores['test_precision']),
      mean(scores['test_recall']),
      mean(scores['test_f1']),
      mean(scores['test_roc_auc'])
      )

which gives me:这给了我:

0.8536558341101569 0.8641939667622551 0.8392201023654705 0.8514895113569482 0.9264002192260914 0.8536558341101569 0.8641939667622551 0.8392201023654705 0.8514895113569482 0.9264002192260914

#re-train the model now with the entire training+validation set, and test it with never-seen-before test-set
RFC = RandomForestClassifier()

RFC.fit(X_train, y_train)
y_pred = RFC.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
y_pred_proba = RFC.predict_proba(X_test)[::,1] 
auc = roc_auc_score(y_test, y_pred_proba)

print(accuracy,
      precision,
      recall,
      f1,
      auc
      )

Now gives me the numbers below, which are clearly worse:现在给我下面的数字,这些数字显然更糟:

0.7809788654060067 0.5113236034222446 0.5044687189672294 0.5078730317420644 0.7589037004728368 0.7809788654060067 0.5113236034222446 0.5044687189672294 0.5078730317420644 0.7589037004728368

I am able to reproduce your scenario with Pima Indians Diabetes Dataset .我可以使用Pima Indians Diabetes Dataset重现您的情况。

The difference you see in the prediction metrics is not consistence and in some runs you may even notice the opposite, because it depends on the selection of the X_test during the split - some of the cases will be easier to predict and will give better metrics and vice versa.您在预测指标中看到的差异不是一致性,在某些运行中您甚至可能会注意到相反的情况,因为这取决于拆分期间 X_test 的选择 - 有些情况更容易预测并且会提供更好的指标和反之亦然。 While Cross-validation runs predictions on the whole set you have in rotation and aggregates this effect, the single X_test set will suffer from effects of random splits.虽然交叉验证在轮换的整个集合上运行预测并聚合此效果,但单个 X_test 集将受到随机拆分的影响。

In order to have better visibility on what is happening here, I have modified your experiment and split in two steps:为了更好地了解这里发生的事情,我修改了您的实验并分为两个步骤:

1. Cross-validation step: 1. 交叉验证步骤:

I use the whole of the X and y sets and run rest of the code as it is我使用整个 X 和 y 集并按原样运行其余代码

rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)
# cv = KFold(n_splits=10)
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}
scores = cross_validate(rfc, X, y, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
      mean(scores['test_precision']),
      mean(scores['test_recall']),
      mean(scores['test_f1']),
      mean(scores['test_roc_auc'])
      )

Output:输出:

0.768257006151743 0.6943032069967433 0.593436328663432 0.6357667086829574 0.8221242747913622

2. Classic train-test step: 2.经典的train-test步骤:

Next I run the plain train-test step, but I do it 50 times with the different train_test splits, and average the metrics (similar to Cross-validation step).接下来,我运行普通的训练测试步骤,但我使用不同的 train_test 拆分进行了 50 次,并对指标求平均值(类似于交叉验证步骤)。

accuracies = []
precisions = []
recalls = []
f1s = []
aucs = []

for i in range(50):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    RFC = RandomForestClassifier()

    RFC.fit(X_train, y_train)
    y_pred = RFC.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    y_pred_proba = RFC.predict_proba(X_test)[::, 1]
    auc = roc_auc_score(y_test, y_pred_proba)
    accuracies.append(accuracy)
    precisions.append(precision)
    recalls.append(recall)
    f1s.append(f1)
    aucs.append(auc)

print(mean(accuracies),
      mean(precisions),
      mean(recalls),
      mean(f1s),
      mean(aucs)
      )

Output:输出:

0.7606926406926405 0.7001931059992001 0.5778712922956755 0.6306501622080503 0.8207846633339568

As expected the prediction metrics are similar.正如预期的那样,预测指标是相似的。 However, the Cross-validation runs much faster and uses each data point of the whole data set for testing (in rotation) by a given number of times.然而,交叉验证运行得更快,并使用整个数据集的每个数据点进行测试(轮流)给定次数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 K-2交叉验证是否实质上等于50:50的火车测试间隔? - Is K-2 cross-validation essentially equal to train-test split of 50:50? 为什么每次我在这个特定的数据集上运行训练测试拆分时,我的 kernel 都会死掉? - Why does my kernel die every time I run train-test split on this particular dataset? 基于python中的多个特征的训练-测试分割的分层交叉验证或抽样 - Stratified Cross Validation or Sampling for train-test split based on multiple features in python 如何意外地训练测试拆分和交叉验证? - How to train-test split and cross-validate in surprise? 如何在回归神经网络中使用 k 折交叉验证而不是 train_test_split - How to use k-fold cross-validation instead of train_test_split for Regression Neural Network GridSearchCV是否执行交叉验证? - Does GridSearchCV perform cross-validation? 火车测试拆分似乎在Python中无法正常工作? - Train-test split does not seem to work properly in Python? train_test_split 网格搜索和交叉验证 - Train_test_split gridsearch and cross validation 拆分测试和训练数据集的交叉验证 - cross validation for split test and train datasets ValueError:通过设置 n_splits=2 或更多,k 折交叉验证需要至少一个训练/测试分割,得到 n_splits=1 - ValueError: k-fold cross-validation requires at least one train/test split by setting n_splits=2 or more, got n_splits=1
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM