简体   繁体   English

如何在 scikit 学习管道中实现 RandomUnderSampler?

[英]How to implement RandomUnderSampler in a scikit learn pipline?

I have a scikit learn pipeline to scale numeric features and encode categorical features.我有一个 scikit 学习管道来缩放数字特征和编码分类特征。 It was working fine until I tried to implement the RandomUnderSampler from imblearn.在我尝试从 imblearn 实现RandomUnderSampler之前,它运行良好。 My goal is to implement the undersampler step since my dataset is very imbalanced 1:1000.我的目标是实现欠采样步骤,因为我的数据集非常不平衡 1:1000。

I made sure to use the Pipeline method from imblearn instead of sklearn.我确保使用 imblearn 的 Pipeline 方法而不是 sklearn。 And below is the code I've tried.下面是我试过的代码。

Code data works (using sklearn pipeline) without the undersampler method.代码数据在没有欠采样器方法的情况下工作(使用 sklearn 管道)。

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.pipeline import Pipeline as Pipeline_imb

from sklearn.base import BaseEstimator, TransformerMixin
class TypeSelector(BaseEstimator, TransformerMixin):
    def __init__(self, dtype):
        self.dtype = dtype
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.select_dtypes(include=[self.dtype])

transformer = Pipeline([
    # Union numeric, categoricals and boolean
    ('features', FeatureUnion(n_jobs=1, transformer_list=[
         # Select bolean features                                                  
        ('boolean', Pipeline([
            ('selector', TypeSelector('bool')),
        ])),
         # Select and scale numericals
        ('numericals', Pipeline([
            ('selector', TypeSelector(np.number)),
            ('scaler', StandardScaler()),
        ])),
         # Select and encode categoricals
        ('categoricals', Pipeline([
            ('selector', TypeSelector('category')),
            ('encoder', OneHotEncoder(handle_unknown='ignore')),
        ])) 
    ])),
])
pipe = Pipeline([('prep', transformer), 
                 ('clf', RandomForestClassifier(n_estimators=500, class_weight='balanced'))
                 ])

Code that does not work (using imblearn pipeline) with the undersampler method.使用欠采样器方法不起作用的代码(使用 imblearn 管道)。

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.pipeline import Pipeline as Pipeline_imb

from sklearn.base import BaseEstimator, TransformerMixin
class TypeSelector(BaseEstimator, TransformerMixin):
    def __init__(self, dtype):
        self.dtype = dtype
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.select_dtypes(include=[self.dtype])

transformer = Pipeline_imb([
    # Union numeric, categoricals and boolean
    ('features', FeatureUnion(n_jobs=1, transformer_list=[
         # Select bolean features                                                  
        ('boolean', Pipeline_imb([
            ('selector', TypeSelector('bool')),
        ])),
         # Select and scale numericals
        ('numericals', Pipeline_imb([
            ('selector', TypeSelector(np.number)),
            ('scaler', StandardScaler()),
        ])),
         # Select and encode categoricals
        ('categoricals', Pipeline_imb([
            ('selector', TypeSelector('category')),
            ('encoder', OneHotEncoder(handle_unknown='ignore')),
        ])) 
    ])),  
])
pipe = Pipeline_imb([
                 ('sampler', RandomUnderSampler(0.1)),
                 ('prep', transformer), 
                 ('clf', RandomForestClassifier(n_estimators=500, class_weight='balanced'))
                 ])

Here is the error I get:这是我得到的错误:

/usr/local/lib/python3.6/dist-packages/sklearn/pipeline.py in __init__(self, steps, memory, verbose)
    133     def __init__(self, steps, memory=None, verbose=False):
    134         self.steps = steps
--> 135         self._validate_steps()
    136         self.memory = memory
    137         self.verbose = verbose

/usr/local/lib/python3.6/dist-packages/imblearn/pipeline.py in _validate_steps(self)
    144             if isinstance(t, pipeline.Pipeline):
    145                 raise TypeError(
--> 146                     "All intermediate steps of the chain should not be"
    147                     " Pipelines")
    148 

TypeError: All intermediate steps of the chain should not be Pipelines

If you explore imblean's code in file imblearn/pipeline.py here , under function _validate_steps , they will check each item in transformers whether there is a transformer that is an instance of scikit's Pipeline or not ( isinstance(t, pipeline.Pipeline) ).如果您在此处查看imblearn/pipeline.py文件中的imblearn/pipeline.py代码,在函数_validate_steps ,他们将检查transformers每个项目是否存在作为 scikit 管道实例的转换器( isinstance(t, pipeline.Pipeline) )。

From your code, transformers are从您的代码中, transformers

  1. RandomUnderSampler
  2. transformer

and class Pipeline_imb inherits scikit's Pipeline while using Pipeline_imb in your code is redundant.并且类Pipeline_imb继承了 scikit 的 Pipeline,而在代码中使用Pipeline_imb是多余的。

That has been said, I would adjust your code like below话虽如此,我会像下面这样调整你的代码

transformer = FeatureUnion(n_jobs=1, transformer_list=[
     # Select bolean features                                                  
    ('selector1', TypeSelector('bool'),
     # Select and scale numericals
    ('selector2', TypeSelector(np.number)),
    ('scaler', StandardScaler()),
     # Select and encode categoricals
    ('selector3', TypeSelector('category')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

pipe = Pipeline_imb([
    ('sampler', RandomUnderSampler(0.1)),
    ('prep', transformer), 
    ('clf', RandomForestClassifier(n_estimators=500, class_weight='balanced'))
])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将带有 keras 回归器的 scikit-learn 管道保存到磁盘? - how to save a scikit-learn pipline with keras regressor inside to disk? 如何在scikit-learn中的感知器中实现“与”功能 - How to implement 'And' function in perceptron in scikit-learn 如何在Scikit中实现自定义类型选择器学习管道 - How to implement a custom type selector in Scikit learn Pipeline 如何在 scikit-learn 中实现可调用的距离度量? - How to implement callable distance metric in scikit-learn? 如何在scikit-learn中实现多项式逻辑回归? - How to implement polynomial logistic regression in scikit-learn? 如何使用Scikit-Learn在Python中实现斐波那契序列? - How to implement Fibonacci Sequence in Python with Scikit-Learn? 如何提取Scikit学习的回归预测变量以实现到C ++中? - How to extract Regression predictor of Scikit-learn to implement into C++? 如何使用 scikit-learn API 实现元估计器? - How to implement a meta-estimator with the scikit-learn API? 如何安装scikit-learn - How to install scikit-learn 使用 scikit-learn 的 SGDClassifier 实现 SVM:如何调整正则化参数? - Using scikit-learn's SGDClassifier to implement SVM: how to tune the regularization parameter?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM