简体   繁体   English

Kfold sklearn中的训练子集

[英]Training subset in Kfold sklearn

Is there a way to train a model using the train subset in 8 of the 10 Kfolds that kf = KFold(n_splits=10) that sklearn has implemented?. 有没有一种方法可以使用sklearn已实现的10个Kfold中的8个kf = KFold(n_splits=10)的训练子集来训练模型?

I want to split my data into three subsets: training, validation, and testing (this can be done by using train_test_split twice I think...). 我想将数据分为三个子集:训练,验证和测试(可以通过两次使用train_test_split来完成...)。

The training set is used to fit the model, the validation set is used to tune the parameters, the test set is used for assessment of the generalization error of the final model. 训练集用于拟合模型,验证集用于调整参数,测试集用于评估最终模型的泛化误差。

But I was wondering if there is a way to just train with 8 of the 10 folds and get an error/accuracy, validate it on 1 fold and finally test it in the last fold getting errors/accuracy for them too. 但是我想知道是否有一种方法可以训练10折中的8折并获得错误/准确性,在1折中进行验证,最后在最后一折中对其进行测试,从而也获得错误/准确性。

See below for my thinking: 我的想法见下文:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
clf = tree.DecisionTreeClassifier(criterion = "entropy", max_depth = 3)
kf = KFold(n_splits=10, shuffle = False, random_state = 0) #define number of splits
kf.get_n_splits(X) #to check how many splits will be done.
for train, test in kf.split(X_train, y_train):

From your question, what I understood is that you want to leave out one or more of your subsets. 从您的问题中,我了解到的是,您想省略一个或多个子集。 In that case, you can leave one or more subsets of data using Leave One Out (LOO) or Leave P Out (LPO) . 在这种情况下,您可以使用“ Leave One Out(LOO)”或“ Leave P Out(LPO)”留下一个或多个数据子集。

you should change this line 你应该改变这一行

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

to

X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=1)

to get exactly what you want. 得到您想要的。 The first train_test_split is splitting in 0.8,0.2 for train, test. 第一个train_test_split被分割为0.8,0.2以进行测试。 The next is splitting the 0.2 in 0.1,0.1 test, val. 下一步是将0.1,0.1测试中的0.2拆分为val。

Then: 然后:

model.fit(X_train, y_train)
print(sklearn.metrics.classification_report(model.predict(X_val, y_val))) 

And based on this report you could check if you proceed with the test data or change the hyperparameters in order to have higher scores on the validation set. 根据此报告,您可以检查是否继续进行测试数据或更改超参数,以便在验证集中获得更高的分数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM