K-2交叉验证是否实质上等于50:50的火车测试间隔？

Question

I am working on an data project assignment where I am asked to use 50% of data for training and remaining 50% of data for testing. 我正在做一个数据项目任务，要求我将50％的数据用于培训，其余50％的数据用于测试。 I would like to use the magic of cross-validation and still meet the aforementioned criteria. 我想使用交叉验证的魔力，并且仍然符合上述条件。

Currently, my code is following: 目前，我的代码如下：

clf = LogisticRegression(penalty='l2', class_weight='balanced'

tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)

#cross validation
cv = StratifiedKFold(n_splits=2)
i = 0
for train, test in cv.split(X, y):
    probas_ = clf.fit(X[train], y[train]).predict_proba(X[test])
    # Compute ROC curve and area the curve
    fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
    tprs.append(interp(mean_fpr, fpr, tpr))
    tprs[-1][0] = 0.0
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)
    i += 1

print("Average AUC: ", sum(aucs)/len(aucs),"AUC: ", aucs[-1],)

Since I am using just 2 splits, is it considered as if I was using train-test split of 50:50? 由于我仅使用2个分割，是否就好像我在使用50:50的火车测试分割一样？ Or should I first split data into 50:50 and then use cross validation on the training part, and finally use that model to test the remaining 50% on test data? 还是我应该先将数据分成50:50，然后在训练部分使用交叉验证，最后使用该模型对测试数据中剩余的50％进行测试？

Answer 1

You should implement your second suggestion. 您应该实施第二个建议。 Cross-validation should be used to tune the parameters of your approach. 交叉验证应用于调整方法的参数。 Among others, such parameters in your example are the value of the C parameter and the class_weight='balanced' of Logistic Regression. 在您的示例中，此类参数尤其是C参数的值和Logistic回归的class_weight='balanced' 。 So you should: 因此，您应该：

split in 50% training, 50% test 参加50％的训练，50％的测试
use the training data to select the optimal values of the parameters of your model with cross-validation 使用训练数据通过交叉验证选择模型参数的最佳值
Refit the model with the optimal parameters on the training data 用训练数据上的最佳参数重新拟合模型
Predict for the test data and report the score of the evaluation measure you selected 预测测试数据并报告您选择的评估指标的分数

Notice, you should use the test data only for reporting the final score and not for tuning the model, otherwise you are cheating. 请注意，您应该仅将测试数据用于报告最终分数，而不能用于调整模型，否则您会作弊。 Imagine, that in reality you may not have access to them until the last moment, so you can not use them. 想象一下，实际上，您可能直到最后一刻才可以访问它们，因此您无法使用它们。

K-2交叉验证是否实质上等于50:50的火车测试间隔？

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-01-22 08:25:13

K-2交叉验证是否实质上等于50:50的火车测试间隔？

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-01-22 08:25:13

解决方案1
0 已采纳 2018-01-22 08:25:13