在 Sci-Kit Learn 中拆分数据集以进行 K 折交叉验证

Question

I was assigned a task that requires creating a Decision Tree Classifier and determining the accuracy rates using the training set and 10-fold cross-validation.我被分配了一项任务，需要创建一个决策树分类器并使用训练集和 10 折交叉验证来确定准确率。 I went over the documentation for cross_val_predict as I believe that this is the module I am going to need.我查看了cross_val_predict的文档，因为我相信这是我需要的模块。

What I am having trouble with, is the splitting of the data set.我遇到的问题是数据集的拆分。 As far as I am aware, in the usual case, the train_test_split() method is used to split the data set into 2 - the train and the test .据我所知，在通常情况下， train_test_split()方法用于将数据集拆分为 2 - train和test 。 From my understanding, for K-fold validation you need to further split the train set into K-number of parts.据我了解，对于 K 折验证，您需要将训练集进一步拆分为 K 个部分。

My question is: do I need to split the data set at the beginning into train and test , or not?我的问题是：我是否需要在开始时将数据集拆分为train和test ？

Answer 1

It depends.这取决于。 My personal opinion is yes you have to split your dataset into training and test set, then you can do a cross-validation on your training set with K-folds.我个人的看法是，您必须将数据集拆分为训练集和测试集，然后您可以使用 K-folds 对您的训练集进行交叉验证。 Why?为什么？ Because it is interesting to test after your training and fine-tuning your model on unseen example.因为在你的训练和微调你的 model 后测试是很有趣的。

But some guys just do a cross-val.但是有些人只是做交叉验证。 Here is the workflow I often use:这是我经常使用的工作流程：

# Data Partition
X_train, X_valid, Y_train, Y_valid = model_selection.train_test_split(X, Y, test_size=0.2, random_state=21)

# Cross validation on multiple model to see which models gives the best results
print('Start cross val')
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
# Then visualize the score you just obtain using mean, std or plot
print('Mean CV-score : ' + str(cv_score.mean()))

# Then I tune the hyper parameters of the best (or top-n best) model using an other cross-val
for param in my_param:
    model = model_with_param
    cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
    print('Mean CV-score with param: ' + str(cv_score.mean()))

# Now I have best parameters for the model, I can train the final model
model = model_with_best_parameters
model.fit(X_train, y_train)

# And finally test your tuned model on the test set
y_pred = model.predict(X_test)
plot_or_print_metric(y_pred, y_test)

Answer 2

Short answer: NO简短的回答：否

Long answer.长答案。 If you want to use K-fold validation when you do not usually split initially into train/test .如果您想在最初通常不拆分为train/test时使用K-fold validation 。

There are a lot of ways to evaluate a model.有很多方法可以评估 model。 The simplest one is to use train/test splitting, fit the model on the train set and evaluate using the test .最简单的一种是使用train/test拆分，在train集上拟合 model 并使用test进行评估。

If you adopt a cross-validation method, then you directly do the fitting/evaluation during each fold/iteration.如果您采用交叉验证方法，那么您在每次折叠/迭代期间直接进行拟合/评估。

It's up to you what to choose but I would go with K-Folds or LOOCV.这取决于您选择什么，但我会选择 go 与 K-Folds 或 LOOCV。

K-Folds procedure is summarised in the figure (for K=5):图中总结了 K-Folds 过程（对于 K=5）：

在 Sci-Kit Learn 中拆分数据集以进行 K 折交叉验证

问题描述

2 个解决方案

解决方案1
4 已采纳 2019-11-12 15:28:47

解决方案2
0 2019-11-12 15:45:36

在 Sci-Kit Learn 中拆分数据集以进行 K 折交叉验证

问题描述

2 个解决方案

解决方案1 4 已采纳 2019-11-12 15:28:47

解决方案2 0 2019-11-12 15:45:36

解决方案1
4 已采纳 2019-11-12 15:28:47

解决方案2
0 2019-11-12 15:45:36