简体   繁体   中英

How to use unified pipelines on numerical and categorical features in machine learning?

Want to run encoder on the categorical features, Imputer (see below) on the numerical features and unified them all together.
For example, Numerical with Categorical features:

df_with_cat = pd.DataFrame({
           'A'      : ['ios', 'android', 'web', 'NaN'],
           'B'      : [4, 4, 'NaN', 2], 
           'target' : [1, 1, 0, 0] 
       })
df_with_cat.head()

    A        B  target
----------------------
0   ios      4    1
1   android  4    1
2   web     NaN   0
3   NaN      2    0

We would want to run Imputer on the numerical features, ie to replace missing values / NaN with the "most_frequent" / "median" / "mean" ==> Pipeline 1 . But we want to transform the categorical features to numbers / OneHotEncoding etc ==> Pipeline 2

What is the best practice to unify them?
ps: Unify the above 2 with the classifier...(random forest / decision tree / GBM)

As mentioned by @Sergey Bushmanov, ColumnTransformer can be utilized to implement the same.

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
           'A'      : ['ios', 'android', 'web', 'NaN'],
           'B'      : [4, 4, 'NaN', 2], 
           'target' : [1, 1, 0, 0] 
       })

categorical_features = ['A']
numeric_features = ['B']
TARGET = ['target']

df[numeric_features]=df[numeric_features].replace('NaN', np.NaN)
columnTransformer = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features),
        ('num', SimpleImputer( strategy='most_frequent'), numeric_features)])

columnTransformer.fit_transform(df)

#
array([[0., 0., 1., 0., 4.],
   [0., 1., 0., 0., 4.],
   [0., 0., 0., 1., 4.],
   [1., 0., 0., 0., 2.]])

Apparently there is a cool way to do it!, for this df:

df_with_cat = pd.DataFrame({
           'A'      : ['ios', 'android', 'web', 'NaN'],
           'B'      : [4, 4, 'NaN', 2], 
           'target' : [1, 1, 0, 0] 
       })

If you don't mind upgrading your sklearn to 0.20.2 , run:

pip3 install scikit-learn==0.20.2

And use this solution (as suggested by @AI_learning):

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

columnTransformer = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), CATEGORICAL_FEATURES),
        ('num', Imputer( strategy='most_frequent'), NUMERICAL_FEATURES)
    ])

And then:

columnTransformer.fit(df_with_cat)

But if you are working with an earlier sklearn version, use this one:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelBinarizer, LabelEncoder 

CATEGORICAL_FEATURES = ['A']
NUMERICAL_FEATURES = ['B']
TARGET = ['target']

numerical_pipline = Pipeline([
    ('selector', DataFrameSelector(NUMERICAL_FEATURES)),
    ('imputer', Imputer(strategy='most_frequent'))
])

categorical_pipeline = Pipeline([
    ('selector', DataFrameSelector(CATEGORICAL_FEATURES)),
    ('cat_encoder', LabelBinarizerPipelineFriendly())
])

If you paid attention we miss the DataFrameSelector , it is not part of sklearn , so let's write it here:

from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

Let's unify them:

from sklearn.pipeline import FeatureUnion, make_pipeline

preprocessing_pipeline = FeatureUnion(transformer_list=[
    ('numerical_pipline', numerical_pipline),
    ('categorical_pipeline', categorical_pipeline)
])

That's it, now let's run:

preprocessing_pipeline.fit_transform(df_with_cat[CATEGORICAL_FEATURES+NUMERICAL_FEATURES])

Now let's go even crazier! Unify them with the classifier pipeline:

from sklearn import tree
clf = tree.DecisionTreeClassifier()
full_pipeline = make_pipeline(preprocessing_pipeline, clf)

And train them together:

full_pipeline.fit(df_with_cat[CATEGORICAL_FEATURES+NUMERICAL_FEATURES], df_with_cat[TARGET])

Just open a Jupyter notebook, take the pieces of code and try it out yourself!

Here is the definition of LabelBinarizerPipelineFriendly():

class LabelBinarizerPipelineFriendly(LabelBinarizer):
    '''
     Wrapper to LabelBinarizer to allow usage in sklearn.pipeline
    '''

    def fit(self, X, y=None):
        """this would allow us to fit the model based on the X input."""
        super(LabelBinarizerPipelineFriendly, self).fit(X)
    def transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).transform(X)

    def fit_transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).fit(X).transform(X)

The major advantage of this approach is that you can then dump the trained model with all the pipeline to pkl file and then you can use the very same in real time (prediction in production)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM