[英]How to implement RandomUnderSampler in a scikit learn pipline?
I have a scikit learn pipeline to scale numeric features and encode categorical features.我有一个 scikit 学习管道来缩放数字特征和编码分类特征。 It was working fine until I tried to implement the RandomUnderSampler from imblearn.
在我尝试从 imblearn 实现RandomUnderSampler之前,它运行良好。 My goal is to implement the undersampler step since my dataset is very imbalanced 1:1000.
我的目标是实现欠采样步骤,因为我的数据集非常不平衡 1:1000。
I made sure to use the Pipeline method from imblearn instead of sklearn.我确保使用 imblearn 的 Pipeline 方法而不是 sklearn。 And below is the code I've tried.
下面是我试过的代码。
Code data works (using sklearn pipeline) without the undersampler method.代码数据在没有欠采样器方法的情况下工作(使用 sklearn 管道)。
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.pipeline import Pipeline as Pipeline_imb
from sklearn.base import BaseEstimator, TransformerMixin
class TypeSelector(BaseEstimator, TransformerMixin):
def __init__(self, dtype):
self.dtype = dtype
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.select_dtypes(include=[self.dtype])
transformer = Pipeline([
# Union numeric, categoricals and boolean
('features', FeatureUnion(n_jobs=1, transformer_list=[
# Select bolean features
('boolean', Pipeline([
('selector', TypeSelector('bool')),
])),
# Select and scale numericals
('numericals', Pipeline([
('selector', TypeSelector(np.number)),
('scaler', StandardScaler()),
])),
# Select and encode categoricals
('categoricals', Pipeline([
('selector', TypeSelector('category')),
('encoder', OneHotEncoder(handle_unknown='ignore')),
]))
])),
])
pipe = Pipeline([('prep', transformer),
('clf', RandomForestClassifier(n_estimators=500, class_weight='balanced'))
])
Code that does not work (using imblearn pipeline) with the undersampler method.使用欠采样器方法不起作用的代码(使用 imblearn 管道)。
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.pipeline import Pipeline as Pipeline_imb
from sklearn.base import BaseEstimator, TransformerMixin
class TypeSelector(BaseEstimator, TransformerMixin):
def __init__(self, dtype):
self.dtype = dtype
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.select_dtypes(include=[self.dtype])
transformer = Pipeline_imb([
# Union numeric, categoricals and boolean
('features', FeatureUnion(n_jobs=1, transformer_list=[
# Select bolean features
('boolean', Pipeline_imb([
('selector', TypeSelector('bool')),
])),
# Select and scale numericals
('numericals', Pipeline_imb([
('selector', TypeSelector(np.number)),
('scaler', StandardScaler()),
])),
# Select and encode categoricals
('categoricals', Pipeline_imb([
('selector', TypeSelector('category')),
('encoder', OneHotEncoder(handle_unknown='ignore')),
]))
])),
])
pipe = Pipeline_imb([
('sampler', RandomUnderSampler(0.1)),
('prep', transformer),
('clf', RandomForestClassifier(n_estimators=500, class_weight='balanced'))
])
Here is the error I get:这是我得到的错误:
/usr/local/lib/python3.6/dist-packages/sklearn/pipeline.py in __init__(self, steps, memory, verbose)
133 def __init__(self, steps, memory=None, verbose=False):
134 self.steps = steps
--> 135 self._validate_steps()
136 self.memory = memory
137 self.verbose = verbose
/usr/local/lib/python3.6/dist-packages/imblearn/pipeline.py in _validate_steps(self)
144 if isinstance(t, pipeline.Pipeline):
145 raise TypeError(
--> 146 "All intermediate steps of the chain should not be"
147 " Pipelines")
148
TypeError: All intermediate steps of the chain should not be Pipelines
If you explore imblean's code in file imblearn/pipeline.py
here , under function _validate_steps
, they will check each item in transformers
whether there is a transformer that is an instance of scikit's Pipeline or not ( isinstance(t, pipeline.Pipeline)
).如果您在此处查看
imblearn/pipeline.py
文件中的imblearn/pipeline.py
代码,在函数_validate_steps
,他们将检查transformers
每个项目是否存在作为 scikit 管道实例的转换器( isinstance(t, pipeline.Pipeline)
)。
From your code, transformers
are从您的代码中,
transformers
是
RandomUnderSampler
transformer
and class Pipeline_imb
inherits scikit's Pipeline while using Pipeline_imb
in your code is redundant.并且类
Pipeline_imb
继承了 scikit 的 Pipeline,而在代码中使用Pipeline_imb
是多余的。
That has been said, I would adjust your code like below话虽如此,我会像下面这样调整你的代码
transformer = FeatureUnion(n_jobs=1, transformer_list=[
# Select bolean features
('selector1', TypeSelector('bool'),
# Select and scale numericals
('selector2', TypeSelector(np.number)),
('scaler', StandardScaler()),
# Select and encode categoricals
('selector3', TypeSelector('category')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
pipe = Pipeline_imb([
('sampler', RandomUnderSampler(0.1)),
('prep', transformer),
('clf', RandomForestClassifier(n_estimators=500, class_weight='balanced'))
])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.