简体   繁体   中英

Scikit-Learn pipeline code difference between ColumnTransformer and FeatureUnion

I'm using Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools... By Aurélien Géron.

I'm trying to run the code in chapter 1 After "Transformation Pipelines" and before "Select and Train a Model".

The old version of book used the following code to do a combined transformation:

from sklearn.base import BaseEstimator , TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self,  attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

from sklearn.pipeline import FeatureUnion
#from sklearn_features.transformers import DataFrameSelector

num_attribs = list(housing_num) 
cat_attribs = ["ocean_proximity"]


num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
    ])
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', LabelBinarizer()),
    ])

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline),
    ])

housing_prepared=full_pipeline.fit_transform( housing  )
housing_prepared

The new code, however, used the newly introduced ColumnTransformer

from sklearn.compose import ColumnTransformer
num_attribs=list(housing_num)
cat_attribs=["ocean_proximity"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(),cat_attribs),
    ])
    housing_prepared=full_pipeline.fit_transform( housing  )
    housing_prepared

I'd like to know why the old versions of code was discontinued and not working, and what's new of ColumnTransformer compare to FeatureUnion.

At a quick glance, what I see is that they used a DataFrameSelector to select which columns to further process in the pipeline. This was pretty cumbersome because you always had to define that DataFrameSelector by hand. This is the problem that ColumnTransofmer solves.

I don't think that the first way "stopped working", it's just that having the second option, you should try to use that instead. Your code snippets are a nice example of how this new feature helps to write clearer code.

Hope this clarifies your doubts!

ColumnTransformer is a better choice than FeatureUnion for data preprocessing step as it is more simple and we need to write less code.

A new alternative to this approach, which you may find simpler, is the new skdag package (disclaimer: I am the author). I wrote this because personally I found ColumnTransformers and FeatureUnions to be hard work, and Pipeline's support for Pandas dataframes wasn't enough for me.

skdag should support everything you're trying to do natively without any need for custom classes to handle dataframes. It lets you build up your workflow as a graph so there's no need for FeatureUnions any more. Here's your example rewritten with skdag :

from skdag import DAGBuilder

dag = (
    DAGBuilder(infer_dataframe=True)
    .add_step("input", "passthrough")
    .add_step("imputer", SimpleImputer(strategy="median"), deps={"input": num_attribs})
    .add_step("attribs_adder", CombinedAttributesAdder(), deps=["imputer"])
    .add_step("std_scaler", StandardScaler(), deps=["attribs_adder"])
    .add_step("label_binarizer", LabelBinarizer(), deps={"input": cat_attribs})
    .add_step("merged", "passthrough", deps=["std_scaler", "label_binarizer"])
    .make_dag()
)

dag.fit_transform(housing)

If you want to visualise the graph, you can call dag.show() in an interactive environment like Jupyter Notebooks, or dag.draw() to produce an image or text file:

dag.show()

达格

Full documentation can be found at https://skdag.readthedocs.io/ .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM