简体   繁体   English

在 Sci-Kit Learn 中拆分数据集以进行 K 折交叉验证

[英]Splitting a data set for K-fold Cross Validation in Sci-Kit Learn

I was assigned a task that requires creating a Decision Tree Classifier and determining the accuracy rates using the training set and 10-fold cross-validation.我被分配了一项任务,需要创建一个决策树分类器并使用训练集和 10 折交叉验证来确定准确率。 I went over the documentation for cross_val_predict as I believe that this is the module I am going to need.我查看了cross_val_predict的文档,因为我相信这是我需要的模块。

What I am having trouble with, is the splitting of the data set.我遇到的问题是数据集的拆分。 As far as I am aware, in the usual case, the train_test_split() method is used to split the data set into 2 - the train and the test .据我所知,在通常情况下, train_test_split()方法用于将数据集拆分为 2 - traintest From my understanding, for K-fold validation you need to further split the train set into K-number of parts.据我了解,对于 K 折验证,您需要将训练集进一步拆分为 K 个部分。

My question is: do I need to split the data set at the beginning into train and test , or not?我的问题是:我是否需要在开始时将数据集拆分为traintest

It depends.这取决于。 My personal opinion is yes you have to split your dataset into training and test set, then you can do a cross-validation on your training set with K-folds.我个人的看法是,您必须将数据集拆分为训练集和测试集,然后您可以使用 K-folds 对您的训练集进行交叉验证。 Why?为什么? Because it is interesting to test after your training and fine-tuning your model on unseen example.因为在你的训练和微调你的 model 后测试是很有趣的。

But some guys just do a cross-val.但是有些人只是做交叉验证。 Here is the workflow I often use:这是我经常使用的工作流程:

# Data Partition
X_train, X_valid, Y_train, Y_valid = model_selection.train_test_split(X, Y, test_size=0.2, random_state=21)

# Cross validation on multiple model to see which models gives the best results
print('Start cross val')
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
# Then visualize the score you just obtain using mean, std or plot
print('Mean CV-score : ' + str(cv_score.mean()))

# Then I tune the hyper parameters of the best (or top-n best) model using an other cross-val
for param in my_param:
    model = model_with_param
    cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
    print('Mean CV-score with param: ' + str(cv_score.mean()))

# Now I have best parameters for the model, I can train the final model
model = model_with_best_parameters
model.fit(X_train, y_train)

# And finally test your tuned model on the test set
y_pred = model.predict(X_test)
plot_or_print_metric(y_pred, y_test)

Short answer: NO简短的回答:否

Long answer.长答案。 If you want to use K-fold validation when you do not usually split initially into train/test .如果您想在最初通常不拆分为train/test时使用K-fold validation

There are a lot of ways to evaluate a model.有很多方法可以评估 model。 The simplest one is to use train/test splitting, fit the model on the train set and evaluate using the test .最简单的一种是使用train/test拆分,在train集上拟合 model 并使用test进行评估。

If you adopt a cross-validation method, then you directly do the fitting/evaluation during each fold/iteration.如果您采用交叉验证方法,那么您在每次折叠/迭代期间直接进行拟合/评估。


It's up to you what to choose but I would go with K-Folds or LOOCV.这取决于您选择什么,但我会选择 go 与 K-Folds 或 LOOCV。

K-Folds procedure is summarised in the figure (for K=5):图中总结了 K-Folds 过程(对于 K=5):

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM