[英]Is K-2 cross-validation essentially equal to train-test split of 50:50?
I am working on an data project assignment where I am asked to use 50% of data for training and remaining 50% of data for testing. 我正在做一个数据项目任务,要求我将50%的数据用于培训,其余50%的数据用于测试。 I would like to use the magic of cross-validation and still meet the aforementioned criteria. 我想使用交叉验证的魔力,并且仍然符合上述条件。
Currently, my code is following: 目前,我的代码如下:
clf = LogisticRegression(penalty='l2', class_weight='balanced'
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
#cross validation
cv = StratifiedKFold(n_splits=2)
i = 0
for train, test in cv.split(X, y):
probas_ = clf.fit(X[train], y[train]).predict_proba(X[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
i += 1
print("Average AUC: ", sum(aucs)/len(aucs),"AUC: ", aucs[-1],)
Since I am using just 2 splits, is it considered as if I was using train-test split of 50:50? 由于我仅使用2个分割,是否就好像我在使用50:50的火车测试分割一样? Or should I first split data into 50:50 and then use cross validation on the training part, and finally use that model to test the remaining 50% on test data? 还是我应该先将数据分成50:50,然后在训练部分使用交叉验证,最后使用该模型对测试数据中剩余的50%进行测试?
You should implement your second suggestion. 您应该实施第二个建议。 Cross-validation should be used to tune the parameters of your approach. 交叉验证应用于调整方法的参数。 Among others, such parameters in your example are the value of the C
parameter and the class_weight='balanced'
of Logistic Regression. 在您的示例中,此类参数尤其是C
参数的值和Logistic回归的class_weight='balanced'
。 So you should: 因此,您应该:
Notice, you should use the test data only for reporting the final score and not for tuning the model, otherwise you are cheating. 请注意,您应该仅将测试数据用于报告最终分数,而不能用于调整模型,否则您会作弊。 Imagine, that in reality you may not have access to them until the last moment, so you can not use them. 想象一下,实际上,您可能直到最后一刻才可以访问它们,因此您无法使用它们。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.