简体   繁体   English

在 Scikit-learn 中使用 Smote 和 Gridsearchcv

[英]Using Smote with Gridsearchcv in Scikit-learn

I'm dealing with an imbalanced dataset and want to do a grid search to tune my model's parameters using scikit's gridsearchcv.我正在处理一个不平衡的数据集,并希望使用 scikit 的 gridsearchcv 进行网格搜索以调整我的模型参数。 To oversample the data, I want to use SMOTE, and I know I can include that as a stage of a pipeline and pass it to gridsearchcv.为了对数据进行过采样,我想使用 SMOTE,而且我知道我可以将其作为管道的一个阶段包含在内并将其传递给 gridsearchcv。 My concern is that I think smote will be applied to both train and validation folds, which is not what you are supposed to do.我担心的是,我认为 smote 将同时应用于训练和验证折叠,这不是您应该做的。 The validation set should not be oversampled.验证集不应过采样。 Am I right that the whole pipeline will be applied to both dataset splits?整个管道将应用于两个数据集拆分是否正确? And if yes, how can I turn around this?如果是,我该如何扭转这种局面? Thanks a lot in advance非常感谢提前

Yes, it can be done, but with imblearn Pipeline .是的,它可以做到,但使用imblearn Pipeline

You see, imblearn has its own Pipeline to handle the samplers correctly.你看, imblearn 有自己的管道来正确处理采样器。 I described this in a similar question here .我在一个类似的问题中描述了这一点

When called predict() on a imblearn.Pipeline object, it will skip the sampling method and leave the data as it is to be passed to next transformer.当在imblearn.Pipeline对象上调用predict()时,它将跳过采样方法并将数据保持原样传递给下一个转换器。 You can confirm that by looking at the source code here :您可以通过查看此处源代码来确认:

        if hasattr(transform, "fit_sample"):
            pass
        else:
            Xt = transform.transform(Xt)

So for this to work correctly, you need the following:因此,为了使其正常工作,您需要以下内容:

from imblearn.pipeline import Pipeline
model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', LogisticRegression())
    ])

grid = GridSearchCV(model, params, ...)
grid.fit(X, y)

Fill the details as necessary, and the pipeline will take care of the rest.根据需要填写详细信息,管道将负责其余部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM