[英]Using Smote with Gridsearchcv in Scikit-learn
I'm dealing with an imbalanced dataset and want to do a grid search to tune my model's parameters using scikit's gridsearchcv.我正在处理一个不平衡的数据集,并希望使用 scikit 的 gridsearchcv 进行网格搜索以调整我的模型参数。 To oversample the data, I want to use SMOTE, and I know I can include that as a stage of a pipeline and pass it to gridsearchcv.为了对数据进行过采样,我想使用 SMOTE,而且我知道我可以将其作为管道的一个阶段包含在内并将其传递给 gridsearchcv。 My concern is that I think smote will be applied to both train and validation folds, which is not what you are supposed to do.我担心的是,我认为 smote 将同时应用于训练和验证折叠,这不是您应该做的。 The validation set should not be oversampled.验证集不应过采样。 Am I right that the whole pipeline will be applied to both dataset splits?整个管道将应用于两个数据集拆分是否正确? And if yes, how can I turn around this?如果是,我该如何扭转这种局面? Thanks a lot in advance非常感谢提前
Yes, it can be done, but with imblearn Pipeline .是的,它可以做到,但使用imblearn Pipeline 。
You see, imblearn has its own Pipeline to handle the samplers correctly.你看, imblearn 有自己的管道来正确处理采样器。 I described this in a similar question here .我在一个类似的问题中描述了这一点。
When called predict()
on a imblearn.Pipeline
object, it will skip the sampling method and leave the data as it is to be passed to next transformer.当在imblearn.Pipeline
对象上调用predict()
时,它将跳过采样方法并将数据保持原样传递给下一个转换器。 You can confirm that by looking at the source code here :您可以通过查看此处的源代码来确认:
if hasattr(transform, "fit_sample"):
pass
else:
Xt = transform.transform(Xt)
So for this to work correctly, you need the following:因此,为了使其正常工作,您需要以下内容:
from imblearn.pipeline import Pipeline
model = Pipeline([
('sampling', SMOTE()),
('classification', LogisticRegression())
])
grid = GridSearchCV(model, params, ...)
grid.fit(X, y)
Fill the details as necessary, and the pipeline will take care of the rest.根据需要填写详细信息,管道将负责其余部分。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.