简体   繁体   English

K-2交叉验证是否实质上等于50:50的火车测试间隔?

[英]Is K-2 cross-validation essentially equal to train-test split of 50:50?

I am working on an data project assignment where I am asked to use 50% of data for training and remaining 50% of data for testing. 我正在做一个数据项目任务,要求我将50%的数据用于培训,其余50%的数据用于测试。 I would like to use the magic of cross-validation and still meet the aforementioned criteria. 我想使用交叉验证的魔力,并且仍然符合上述条件。

Currently, my code is following: 目前,我的代码如下:

clf = LogisticRegression(penalty='l2', class_weight='balanced'

tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)

#cross validation
cv = StratifiedKFold(n_splits=2)
i = 0
for train, test in cv.split(X, y):
    probas_ = clf.fit(X[train], y[train]).predict_proba(X[test])
    # Compute ROC curve and area the curve
    fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
    tprs.append(interp(mean_fpr, fpr, tpr))
    tprs[-1][0] = 0.0
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)
    i += 1

print("Average AUC: ", sum(aucs)/len(aucs),"AUC: ", aucs[-1],)

Since I am using just 2 splits, is it considered as if I was using train-test split of 50:50? 由于我仅使用2个分割,是否就好像我在使用50:50的火车测试分割一样? Or should I first split data into 50:50 and then use cross validation on the training part, and finally use that model to test the remaining 50% on test data? 还是我应该先将数据分成50:50,然后在训练部分使用交叉验证,最后使用该模型对测试数据中剩余的50%进行测试?

You should implement your second suggestion. 您应该实施第二个建议。 Cross-validation should be used to tune the parameters of your approach. 交叉验证应用于调整方法的参数。 Among others, such parameters in your example are the value of the C parameter and the class_weight='balanced' of Logistic Regression. 在您的示例中,此类参数尤其是C参数的值和Logistic回归的class_weight='balanced' So you should: 因此,您应该:

  • split in 50% training, 50% test 参加50%的训练,50%的测试
  • use the training data to select the optimal values of the parameters of your model with cross-validation 使用训练数据通过交叉验证选择模型参数的最佳值
  • Refit the model with the optimal parameters on the training data 用训练数据上的最佳参数重新拟合模型
  • Predict for the test data and report the score of the evaluation measure you selected 预测测试数据并报告您选择的评估指标的分数

Notice, you should use the test data only for reporting the final score and not for tuning the model, otherwise you are cheating. 请注意,您应该将测试数据用于报告最终分数,而不能用于调整模型,否则您会作弊。 Imagine, that in reality you may not have access to them until the last moment, so you can not use them. 想象一下,实际上,您可能直到最后一刻才可以访问它们,因此您无法使用它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么我的交叉验证始终比训练测试拆分表现更好? - Why does my cross-validation consistently perform better than train-test split? 如何在回归神经网络中使用 k 折交叉验证而不是 train_test_split - How to use k-fold cross-validation instead of train_test_split for Regression Neural Network 基于python中的多个特征的训练-测试分割的分层交叉验证或抽样 - Stratified Cross Validation or Sampling for train-test split based on multiple features in python 如何意外地训练测试拆分和交叉验证? - How to train-test split and cross-validate in surprise? 如何拆分交叉验证以拆分火车和测试装置? - How can I do K fold cross-validation for splitting the train and test set? ValueError:通过设置 n_splits=2 或更多,k 折交叉验证需要至少一个训练/测试分割,得到 n_splits=1 - ValueError: k-fold cross-validation requires at least one train/test split by setting n_splits=2 or more, got n_splits=1 应用分层 k 折交叉验证后如何将数据拆分为测试和训练? - How to split data into test and train after applying stratified k-fold cross validation? train_test_split 网格搜索和交叉验证 - Train_test_split gridsearch and cross validation 拆分测试和训练数据集的交叉验证 - cross validation for split test and train datasets 用于 LSTM 的时间序列数据的训练测试拆分 - Train-Test split for Time Series Data to be used for LSTM
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM