Want to run encoder on the categorical features, Imputer (see below) on the numerical features and unified them all together.
For example, Numerical with Categorical features:
df_with_cat = pd.DataFrame({
'A' : ['ios', 'android', 'web', 'NaN'],
'B' : [4, 4, 'NaN', 2],
'target' : [1, 1, 0, 0]
})
df_with_cat.head()
A B target
----------------------
0 ios 4 1
1 android 4 1
2 web NaN 0
3 NaN 2 0
We would want to run Imputer on the numerical features, ie to replace missing values / NaN with the "most_frequent" / "median" / "mean" ==> Pipeline 1 . But we want to transform the categorical features to numbers / OneHotEncoding etc ==> Pipeline 2
What is the best practice to unify them?
ps: Unify the above 2 with the classifier...(random forest / decision tree / GBM)
As mentioned by @Sergey Bushmanov, ColumnTransformer can be utilized to implement the same.
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({
'A' : ['ios', 'android', 'web', 'NaN'],
'B' : [4, 4, 'NaN', 2],
'target' : [1, 1, 0, 0]
})
categorical_features = ['A']
numeric_features = ['B']
TARGET = ['target']
df[numeric_features]=df[numeric_features].replace('NaN', np.NaN)
columnTransformer = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(), categorical_features),
('num', SimpleImputer( strategy='most_frequent'), numeric_features)])
columnTransformer.fit_transform(df)
#
array([[0., 0., 1., 0., 4.],
[0., 1., 0., 0., 4.],
[0., 0., 0., 1., 4.],
[1., 0., 0., 0., 2.]])
Apparently there is a cool way to do it!, for this df:
df_with_cat = pd.DataFrame({
'A' : ['ios', 'android', 'web', 'NaN'],
'B' : [4, 4, 'NaN', 2],
'target' : [1, 1, 0, 0]
})
If you don't mind upgrading your sklearn to 0.20.2
, run:
pip3 install scikit-learn==0.20.2
And use this solution (as suggested by @AI_learning):
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
columnTransformer = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(), CATEGORICAL_FEATURES),
('num', Imputer( strategy='most_frequent'), NUMERICAL_FEATURES)
])
And then:
columnTransformer.fit(df_with_cat)
But if you are working with an earlier sklearn version, use this one:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
CATEGORICAL_FEATURES = ['A']
NUMERICAL_FEATURES = ['B']
TARGET = ['target']
numerical_pipline = Pipeline([
('selector', DataFrameSelector(NUMERICAL_FEATURES)),
('imputer', Imputer(strategy='most_frequent'))
])
categorical_pipeline = Pipeline([
('selector', DataFrameSelector(CATEGORICAL_FEATURES)),
('cat_encoder', LabelBinarizerPipelineFriendly())
])
If you paid attention we miss the DataFrameSelector
, it is not part of sklearn
, so let's write it here:
from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
Let's unify them:
from sklearn.pipeline import FeatureUnion, make_pipeline
preprocessing_pipeline = FeatureUnion(transformer_list=[
('numerical_pipline', numerical_pipline),
('categorical_pipeline', categorical_pipeline)
])
That's it, now let's run:
preprocessing_pipeline.fit_transform(df_with_cat[CATEGORICAL_FEATURES+NUMERICAL_FEATURES])
Now let's go even crazier! Unify them with the classifier pipeline:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
full_pipeline = make_pipeline(preprocessing_pipeline, clf)
And train them together:
full_pipeline.fit(df_with_cat[CATEGORICAL_FEATURES+NUMERICAL_FEATURES], df_with_cat[TARGET])
Just open a Jupyter notebook, take the pieces of code and try it out yourself!
Here is the definition of LabelBinarizerPipelineFriendly():
class LabelBinarizerPipelineFriendly(LabelBinarizer):
'''
Wrapper to LabelBinarizer to allow usage in sklearn.pipeline
'''
def fit(self, X, y=None):
"""this would allow us to fit the model based on the X input."""
super(LabelBinarizerPipelineFriendly, self).fit(X)
def transform(self, X, y=None):
return super(LabelBinarizerPipelineFriendly, self).transform(X)
def fit_transform(self, X, y=None):
return super(LabelBinarizerPipelineFriendly, self).fit(X).transform(X)
The major advantage of this approach is that you can then dump the trained model with all the pipeline to pkl file and then you can use the very same in real time (prediction in production)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.