在 Scikit-learn 中使用 Smote 和 Gridsearchcv

Question

I'm dealing with an imbalanced dataset and want to do a grid search to tune my model's parameters using scikit's gridsearchcv.我正在处理一个不平衡的数据集，并希望使用 scikit 的 gridsearchcv 进行网格搜索以调整我的模型参数。 To oversample the data, I want to use SMOTE, and I know I can include that as a stage of a pipeline and pass it to gridsearchcv.为了对数据进行过采样，我想使用 SMOTE，而且我知道我可以将其作为管道的一个阶段包含在内并将其传递给 gridsearchcv。 My concern is that I think smote will be applied to both train and validation folds, which is not what you are supposed to do.我担心的是，我认为 smote 将同时应用于训练和验证折叠，这不是您应该做的。 The validation set should not be oversampled.验证集不应过采样。 Am I right that the whole pipeline will be applied to both dataset splits?整个管道将应用于两个数据集拆分是否正确？ And if yes, how can I turn around this?如果是，我该如何扭转这种局面？ Thanks a lot in advance非常感谢提前

Answer 1

Yes, it can be done, but with imblearn Pipeline .是的，它可以做到，但使用imblearn Pipeline 。

You see, imblearn has its own Pipeline to handle the samplers correctly.你看， imblearn 有自己的管道来正确处理采样器。 I described this in a similar question here .我在一个类似的问题中描述了这一点。

When called predict() on a imblearn.Pipeline object, it will skip the sampling method and leave the data as it is to be passed to next transformer.当在imblearn.Pipeline对象上调用predict()时，它将跳过采样方法并将数据保持原样传递给下一个转换器。 You can confirm that by looking at the source code here :您可以通过查看此处的源代码来确认：

        if hasattr(transform, "fit_sample"):
            pass
        else:
            Xt = transform.transform(Xt)

So for this to work correctly, you need the following:因此，为了使其正常工作，您需要以下内容：

from imblearn.pipeline import Pipeline
model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', LogisticRegression())
    ])

grid = GridSearchCV(model, params, ...)
grid.fit(X, y)

Fill the details as necessary, and the pipeline will take care of the rest.根据需要填写详细信息，管道将负责其余部分。

在 Scikit-learn 中使用 Smote 和 Gridsearchcv

问题描述

1 个解决方案

解决方案1
41 已采纳 2018-05-09 05:15:11

在 Scikit-learn 中使用 Smote 和 Gridsearchcv

问题描述

1 个解决方案

解决方案1 41 已采纳 2018-05-09 05:15:11

解决方案1
41 已采纳 2018-05-09 05:15:11