简体   繁体   English

使用 sklearn 与 GroupKFold 进行嵌套交叉验证

[英]Nested cross-validation with GroupKFold with sklearn

In my data, several entries correspond to a single subject and I don't won't to mix those entries between the train and the test set.在我的数据中,多个条目对应于一个主题,我不会在火车和测试集之间混合这些条目。 For this reason, I looked at the GroupKFold fold iterator, that according to the sklearn documentation is a "K-fold iterator variant with non-overlapping groups."出于这个原因,我查看了GroupKFold折叠迭代器,根据sklearn文档,它是一个“具有非重叠组的 K 折叠迭代器变体”。 Therefore, I would like to implement nested cross-validation using GroupKFold to split test and train set.因此,我想使用GroupKFold来实现嵌套交叉验证来拆分测试和训练集。

I started from the template given in this question .我从这个问题中给出的模板开始。 However, I got into an error calling the fit method on the grid instance saying that groups has not the same shape of X and the y .但是,我在调用网格实例上的fit方法时出错,说groupsXy形状不同。 To solve that, I sliced groups too using the train index.为了解决这个问题,我也使用火车索引对groups进行了切片。

Is this implementation correct?这个实现是否正确? I mostly care about not mixing data from the same groups between train and test set.我最关心的是不要在训练集和测试集之间混合来自同一组的数据。

inner_cv = GroupKFold(n_splits=inner_fold)
outer_cv = GroupKFold(n_splits=out_fold)


for train_index, test_index in outer_cv.split(x, y, groups=groups):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]

    grid = RandomizedSearchCV(estimator=model,
                                param_distributions=parameters_grid,
                                cv=inner_cv,
                                scoring=get_scoring(),
                                refit='roc_auc_scorer',
                                return_train_score=True,
                                verbose=1,
                                n_jobs=jobs)
    grid.fit(x_train, y_train, groups=groups[train_index])
    prediction = grid.predict(x_test)

One way you can confirm that the code is doing as you intend (ie not mixing data between groups) is that you can pass not the GroupKFold object but the output (the indices) of GroupKFold.split to RandomizedSearchCV.您可以确认代码按您的意图执行(即不在组之间混合数据)的一种方法是您可以不传递GroupKFold对象,而是将GroupKFold的输出(索引) GroupKFold.split给 RandomizedSearchCV。 eg例如

grid = RandomizedSearchCV(estimator=model,
                            param_distributions=parameters_grid,
                            cv=inner_cv.split(
                              x_train, y_train, groups=groups[train_index]),
                            scoring=get_scoring(),
                            refit='roc_auc_scorer',
                            return_train_score=True,
                            verbose=1,
                            n_jobs=jobs)
grid.fit(x_train, y_train)

I believe this leads to the same fitting result, and here you've explicitly given the indices of training/validation for each fold of the cross-validation.我相信这会导致相同的拟合结果,在这里您已经明确给出了交叉验证的每个折叠的训练/验证指数。

As far as I can see, these two ways of doing it are equivalent, but I think the way your example is written is more elegant since you aren't providing x_train and y_train twice.据我所知,这两种方法是等效的,但我认为您编写示例的方式更优雅,因为您没有提供x_trainy_train两次。

And it appears correct to slice groups using train_index , since you're only passing the sliced x and y variables to the fit method.使用train_indexgroups进行切片似乎是正确的,因为您只是将切片的xy变量传递给fit方法。 I have to remind myself that the inner cross-validation will be doing cross-validation on the training subset of the outer cross-validation operation.我必须提醒自己,内部交叉验证将对外部交叉验证操作的训练子集进行交叉验证。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM