简体   繁体   English

如何在机器学习中的数值和分类特征上使用统一管道?

[英]How to use unified pipelines on numerical and categorical features in machine learning?

Want to run encoder on the categorical features, Imputer (see below) on the numerical features and unified them all together.想要在分类特征上运行编码器,在数值特征上运行 Imputer(见下文)并将它们统一在一起。
For example, Numerical with Categorical features:例如,具有分类特征的数值:

df_with_cat = pd.DataFrame({
           'A'      : ['ios', 'android', 'web', 'NaN'],
           'B'      : [4, 4, 'NaN', 2], 
           'target' : [1, 1, 0, 0] 
       })
df_with_cat.head()

    A        B  target
----------------------
0   ios      4    1
1   android  4    1
2   web     NaN   0
3   NaN      2    0

We would want to run Imputer on the numerical features, ie to replace missing values / NaN with the "most_frequent" / "median" / "mean" ==> Pipeline 1 .我们希望在数值特征上运行 Imputer,即用 "most_frequent" / "median" / "mean" ==> Pipeline 1替换缺失值 / NaN。 But we want to transform the categorical features to numbers / OneHotEncoding etc ==> Pipeline 2但是我们想将分类特征转换为数字 / OneHotEncoding 等 ==> Pipeline 2

What is the best practice to unify them?统一它们的最佳做法是什么?
ps: Unify the above 2 with the classifier...(random forest / decision tree / GBM) ps:用分类器统一以上2...(随机森林/决策树/GBM)

As mentioned by @Sergey Bushmanov, ColumnTransformer can be utilized to implement the same.正如@Sergey Bushmanov 所提到的,ColumnTransformer 可用于实现相同的功能。

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
           'A'      : ['ios', 'android', 'web', 'NaN'],
           'B'      : [4, 4, 'NaN', 2], 
           'target' : [1, 1, 0, 0] 
       })

categorical_features = ['A']
numeric_features = ['B']
TARGET = ['target']

df[numeric_features]=df[numeric_features].replace('NaN', np.NaN)
columnTransformer = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features),
        ('num', SimpleImputer( strategy='most_frequent'), numeric_features)])

columnTransformer.fit_transform(df)

#
array([[0., 0., 1., 0., 4.],
   [0., 1., 0., 0., 4.],
   [0., 0., 0., 1., 4.],
   [1., 0., 0., 0., 2.]])

Apparently there is a cool way to do it!, for this df:显然有一个很酷的方法来做到这一点!,对于这个 df:

df_with_cat = pd.DataFrame({
           'A'      : ['ios', 'android', 'web', 'NaN'],
           'B'      : [4, 4, 'NaN', 2], 
           'target' : [1, 1, 0, 0] 
       })

If you don't mind upgrading your sklearn to 0.20.2 , run:如果您不介意将 sklearn 升级到0.20.2 ,请运行:

pip3 install scikit-learn==0.20.2

And use this solution (as suggested by @AI_learning):并使用此解决方案(如@AI_learning 所建议):

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

columnTransformer = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), CATEGORICAL_FEATURES),
        ('num', Imputer( strategy='most_frequent'), NUMERICAL_FEATURES)
    ])

And then:进而:

columnTransformer.fit(df_with_cat)

But if you are working with an earlier sklearn version, use this one:但是,如果您使用的是较早的 sklearn 版本,请使用以下版本:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelBinarizer, LabelEncoder 

CATEGORICAL_FEATURES = ['A']
NUMERICAL_FEATURES = ['B']
TARGET = ['target']

numerical_pipline = Pipeline([
    ('selector', DataFrameSelector(NUMERICAL_FEATURES)),
    ('imputer', Imputer(strategy='most_frequent'))
])

categorical_pipeline = Pipeline([
    ('selector', DataFrameSelector(CATEGORICAL_FEATURES)),
    ('cat_encoder', LabelBinarizerPipelineFriendly())
])

If you paid attention we miss the DataFrameSelector , it is not part of sklearn , so let's write it here:如果你注意到我们错过了DataFrameSelector ,它不是sklearn一部分,所以让我们在这里写:

from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

Let's unify them:让我们统一它们:

from sklearn.pipeline import FeatureUnion, make_pipeline

preprocessing_pipeline = FeatureUnion(transformer_list=[
    ('numerical_pipline', numerical_pipline),
    ('categorical_pipeline', categorical_pipeline)
])

That's it, now let's run:就是这样,现在让我们运行:

preprocessing_pipeline.fit_transform(df_with_cat[CATEGORICAL_FEATURES+NUMERICAL_FEATURES])

Now let's go even crazier!现在让我们更疯狂! Unify them with the classifier pipeline:使用分类器管道统一它们:

from sklearn import tree
clf = tree.DecisionTreeClassifier()
full_pipeline = make_pipeline(preprocessing_pipeline, clf)

And train them together:并一起训练他们:

full_pipeline.fit(df_with_cat[CATEGORICAL_FEATURES+NUMERICAL_FEATURES], df_with_cat[TARGET])

Just open a Jupyter notebook, take the pieces of code and try it out yourself!只需打开一个 Jupyter 笔记本,获取代码片段并亲自尝试一下!

Here is the definition of LabelBinarizerPipelineFriendly():这是 LabelBinarizerPipelineFriendly() 的定义:

class LabelBinarizerPipelineFriendly(LabelBinarizer):
    '''
     Wrapper to LabelBinarizer to allow usage in sklearn.pipeline
    '''

    def fit(self, X, y=None):
        """this would allow us to fit the model based on the X input."""
        super(LabelBinarizerPipelineFriendly, self).fit(X)
    def transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).transform(X)

    def fit_transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).fit(X).transform(X)

The major advantage of this approach is that you can then dump the trained model with all the pipeline to pkl file and then you can use the very same in real time (prediction in production)这种方法的主要优点是您可以将所有管道的训练模型转储到 pkl 文件,然后您可以实时使用相同的模型(生产中的预测)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM