In my data, several entries correspond to a single subject and I don't won't to mix those entries between the train and the test set. For this reason, I looked at the GroupKFold
fold iterator, that according to the sklearn
documentation is a "K-fold iterator variant with non-overlapping groups." Therefore, I would like to implement nested cross-validation using GroupKFold
to split test and train set.
I started from the template given in this question . However, I got into an error calling the fit
method on the grid instance saying that groups
has not the same shape of X
and the y
. To solve that, I sliced groups
too using the train index.
Is this implementation correct? I mostly care about not mixing data from the same groups between train and test set.
inner_cv = GroupKFold(n_splits=inner_fold)
outer_cv = GroupKFold(n_splits=out_fold)
for train_index, test_index in outer_cv.split(x, y, groups=groups):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
grid = RandomizedSearchCV(estimator=model,
param_distributions=parameters_grid,
cv=inner_cv,
scoring=get_scoring(),
refit='roc_auc_scorer',
return_train_score=True,
verbose=1,
n_jobs=jobs)
grid.fit(x_train, y_train, groups=groups[train_index])
prediction = grid.predict(x_test)
One way you can confirm that the code is doing as you intend (ie not mixing data between groups) is that you can pass not the GroupKFold
object but the output (the indices) of GroupKFold.split
to RandomizedSearchCV. eg
grid = RandomizedSearchCV(estimator=model,
param_distributions=parameters_grid,
cv=inner_cv.split(
x_train, y_train, groups=groups[train_index]),
scoring=get_scoring(),
refit='roc_auc_scorer',
return_train_score=True,
verbose=1,
n_jobs=jobs)
grid.fit(x_train, y_train)
I believe this leads to the same fitting result, and here you've explicitly given the indices of training/validation for each fold of the cross-validation.
As far as I can see, these two ways of doing it are equivalent, but I think the way your example is written is more elegant since you aren't providing x_train
and y_train
twice.
And it appears correct to slice groups
using train_index
, since you're only passing the sliced x
and y
variables to the fit
method. I have to remind myself that the inner cross-validation will be doing cross-validation on the training subset of the outer cross-validation operation.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.