如何在 scikit 学习管道中实现 RandomUnderSampler？

Question

I have a scikit learn pipeline to scale numeric features and encode categorical features.我有一个 scikit 学习管道来缩放数字特征和编码分类特征。 It was working fine until I tried to implement the RandomUnderSampler from imblearn.在我尝试从 imblearn 实现RandomUnderSampler之前，它运行良好。 My goal is to implement the undersampler step since my dataset is very imbalanced 1:1000.我的目标是实现欠采样步骤，因为我的数据集非常不平衡 1:1000。

I made sure to use the Pipeline method from imblearn instead of sklearn.我确保使用 imblearn 的 Pipeline 方法而不是 sklearn。 And below is the code I've tried.下面是我试过的代码。

Code data works (using sklearn pipeline) without the undersampler method.代码数据在没有欠采样器方法的情况下工作（使用 sklearn 管道）。

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.pipeline import Pipeline as Pipeline_imb

from sklearn.base import BaseEstimator, TransformerMixin
class TypeSelector(BaseEstimator, TransformerMixin):
    def __init__(self, dtype):
        self.dtype = dtype
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.select_dtypes(include=[self.dtype])

transformer = Pipeline([
    # Union numeric, categoricals and boolean
    ('features', FeatureUnion(n_jobs=1, transformer_list=[
         # Select bolean features                                                  
        ('boolean', Pipeline([
            ('selector', TypeSelector('bool')),
        ])),
         # Select and scale numericals
        ('numericals', Pipeline([
            ('selector', TypeSelector(np.number)),
            ('scaler', StandardScaler()),
        ])),
         # Select and encode categoricals
        ('categoricals', Pipeline([
            ('selector', TypeSelector('category')),
            ('encoder', OneHotEncoder(handle_unknown='ignore')),
        ])) 
    ])),
])
pipe = Pipeline([('prep', transformer), 
                 ('clf', RandomForestClassifier(n_estimators=500, class_weight='balanced'))
                 ])

Code that does not work (using imblearn pipeline) with the undersampler method.使用欠采样器方法不起作用的代码（使用 imblearn 管道）。

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.pipeline import Pipeline as Pipeline_imb

from sklearn.base import BaseEstimator, TransformerMixin
class TypeSelector(BaseEstimator, TransformerMixin):
    def __init__(self, dtype):
        self.dtype = dtype
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.select_dtypes(include=[self.dtype])

transformer = Pipeline_imb([
    # Union numeric, categoricals and boolean
    ('features', FeatureUnion(n_jobs=1, transformer_list=[
         # Select bolean features                                                  
        ('boolean', Pipeline_imb([
            ('selector', TypeSelector('bool')),
        ])),
         # Select and scale numericals
        ('numericals', Pipeline_imb([
            ('selector', TypeSelector(np.number)),
            ('scaler', StandardScaler()),
        ])),
         # Select and encode categoricals
        ('categoricals', Pipeline_imb([
            ('selector', TypeSelector('category')),
            ('encoder', OneHotEncoder(handle_unknown='ignore')),
        ])) 
    ])),  
])
pipe = Pipeline_imb([
                 ('sampler', RandomUnderSampler(0.1)),
                 ('prep', transformer), 
                 ('clf', RandomForestClassifier(n_estimators=500, class_weight='balanced'))
                 ])

Here is the error I get:这是我得到的错误：

/usr/local/lib/python3.6/dist-packages/sklearn/pipeline.py in __init__(self, steps, memory, verbose)
    133     def __init__(self, steps, memory=None, verbose=False):
    134         self.steps = steps
--> 135         self._validate_steps()
    136         self.memory = memory
    137         self.verbose = verbose

/usr/local/lib/python3.6/dist-packages/imblearn/pipeline.py in _validate_steps(self)
    144             if isinstance(t, pipeline.Pipeline):
    145                 raise TypeError(
--> 146                     "All intermediate steps of the chain should not be"
    147                     " Pipelines")
    148 

TypeError: All intermediate steps of the chain should not be Pipelines

Answer 1

If you explore imblean's code in file imblearn/pipeline.py here , under function _validate_steps , they will check each item in transformers whether there is a transformer that is an instance of scikit's Pipeline or not ( isinstance(t, pipeline.Pipeline) ).如果您在此处查看imblearn/pipeline.py文件中的imblearn/pipeline.py代码，在函数_validate_steps ，他们将检查transformers每个项目是否存在作为 scikit 管道实例的转换器（ isinstance(t, pipeline.Pipeline) ）。

From your code, transformers are从您的代码中， transformers是

RandomUnderSampler
transformer

and class Pipeline_imb inherits scikit's Pipeline while using Pipeline_imb in your code is redundant.并且类Pipeline_imb继承了 scikit 的 Pipeline，而在代码中使用Pipeline_imb是多余的。

That has been said, I would adjust your code like below话虽如此，我会像下面这样调整你的代码

transformer = FeatureUnion(n_jobs=1, transformer_list=[
     # Select bolean features                                                  
    ('selector1', TypeSelector('bool'),
     # Select and scale numericals
    ('selector2', TypeSelector(np.number)),
    ('scaler', StandardScaler()),
     # Select and encode categoricals
    ('selector3', TypeSelector('category')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

pipe = Pipeline_imb([
    ('sampler', RandomUnderSampler(0.1)),
    ('prep', transformer), 
    ('clf', RandomForestClassifier(n_estimators=500, class_weight='balanced'))
])

如何在 scikit 学习管道中实现 RandomUnderSampler？

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-01-08 17:09:47

如何在 scikit 学习管道中实现 RandomUnderSampler？

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-01-08 17:09:47

解决方案1
3 已采纳 2020-01-08 17:09:47