I'm using Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools... By Aurélien Géron.
I'm trying to run the code in chapter 1 After "Transformation Pipelines" and before "Select and Train a Model".
The old version of book used the following code to do a combined transformation:
from sklearn.base import BaseEstimator , TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
from sklearn.pipeline import FeatureUnion
#from sklearn_features.transformers import DataFrameSelector
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer()),
])
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
housing_prepared=full_pipeline.fit_transform( housing )
housing_prepared
The new code, however, used the newly introduced ColumnTransformer
from sklearn.compose import ColumnTransformer
num_attribs=list(housing_num)
cat_attribs=["ocean_proximity"]
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(),cat_attribs),
])
housing_prepared=full_pipeline.fit_transform( housing )
housing_prepared
I'd like to know why the old versions of code was discontinued and not working, and what's new of ColumnTransformer compare to FeatureUnion.
At a quick glance, what I see is that they used a DataFrameSelector
to select which columns to further process in the pipeline. This was pretty cumbersome because you always had to define that DataFrameSelector
by hand. This is the problem that ColumnTransofmer
solves.
I don't think that the first way "stopped working", it's just that having the second option, you should try to use that instead. Your code snippets are a nice example of how this new feature helps to write clearer code.
Hope this clarifies your doubts!
ColumnTransformer
is a better choice than FeatureUnion
for data preprocessing step as it is more simple and we need to write less code.
A new alternative to this approach, which you may find simpler, is the new skdag
package (disclaimer: I am the author). I wrote this because personally I found ColumnTransformers and FeatureUnions to be hard work, and Pipeline's support for Pandas dataframes wasn't enough for me.
skdag
should support everything you're trying to do natively without any need for custom classes to handle dataframes. It lets you build up your workflow as a graph so there's no need for FeatureUnions any more. Here's your example rewritten with skdag
:
from skdag import DAGBuilder
dag = (
DAGBuilder(infer_dataframe=True)
.add_step("input", "passthrough")
.add_step("imputer", SimpleImputer(strategy="median"), deps={"input": num_attribs})
.add_step("attribs_adder", CombinedAttributesAdder(), deps=["imputer"])
.add_step("std_scaler", StandardScaler(), deps=["attribs_adder"])
.add_step("label_binarizer", LabelBinarizer(), deps={"input": cat_attribs})
.add_step("merged", "passthrough", deps=["std_scaler", "label_binarizer"])
.make_dag()
)
dag.fit_transform(housing)
If you want to visualise the graph, you can call dag.show()
in an interactive environment like Jupyter Notebooks, or dag.draw() to produce an image or text file:
dag.show()
Full documentation can be found at https://skdag.readthedocs.io/ .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.