简体   繁体   English

Scikit-Learn ColumnTransformer 和 FeatureUnion 管道代码区别

[英]Scikit-Learn pipeline code difference between ColumnTransformer and FeatureUnion

I'm using Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools... By Aurélien Géron.我正在使用带有 Scikit-Learn 和 TensorFlow 的动手机器学习:概念,工具... 作者:Aurélien Géron。

I'm trying to run the code in chapter 1 After "Transformation Pipelines" and before "Select and Train a Model".我正在尝试在“转换管道”之后和“选择和训练模型”之前运行第 1 章中的代码。

The old version of book used the following code to do a combined transformation:旧版书使用以下代码进行组合转换:

from sklearn.base import BaseEstimator , TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self,  attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

from sklearn.pipeline import FeatureUnion
#from sklearn_features.transformers import DataFrameSelector

num_attribs = list(housing_num) 
cat_attribs = ["ocean_proximity"]


num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
    ])
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', LabelBinarizer()),
    ])

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline),
    ])

housing_prepared=full_pipeline.fit_transform( housing  )
housing_prepared

The new code, however, used the newly introduced ColumnTransformer然而,新代码使用了新引入的 ColumnTransformer

from sklearn.compose import ColumnTransformer
num_attribs=list(housing_num)
cat_attribs=["ocean_proximity"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(),cat_attribs),
    ])
    housing_prepared=full_pipeline.fit_transform( housing  )
    housing_prepared

I'd like to know why the old versions of code was discontinued and not working, and what's new of ColumnTransformer compare to FeatureUnion.我想知道为什么旧版本的代码已停产并且无法正常工作,以及 ColumnTransformer 与 FeatureUnion 相比有什么新功能。

At a quick glance, what I see is that they used a DataFrameSelector to select which columns to further process in the pipeline. 快速浏览一下,我看到他们使用DataFrameSelector来选择要在管道中进一步处理的列。 This was pretty cumbersome because you always had to define that DataFrameSelector by hand. 这非常麻烦,因为您总是需要手动定义DataFrameSelector This is the problem that ColumnTransofmer solves. 这是ColumnTransofmer解决的问题。

I don't think that the first way "stopped working", it's just that having the second option, you should try to use that instead. 我不认为第一种方式“停止工作”,只是有第二种选择,你应该尝试使用它。 Your code snippets are a nice example of how this new feature helps to write clearer code. 您的代码片段是这个新功能如何帮助编写更清晰代码的一个很好的例子。

Hope this clarifies your doubts! 希望这能澄清你的疑虑!

ColumnTransformer is a better choice than FeatureUnion for data preprocessing step as it is more simple and we need to write less code.对于数据预处理步骤, ColumnTransformer是比FeatureUnion更好的选择,因为它更简单,我们需要编写的代码更少。

A new alternative to this approach, which you may find simpler, is the new skdag package (disclaimer: I am the author).这种方法的一种新替代方法可能会更简单,它是新的skdag package(免责声明:我是作者)。 I wrote this because personally I found ColumnTransformers and FeatureUnions to be hard work, and Pipeline's support for Pandas dataframes wasn't enough for me.我写这篇文章是因为我个人发现 ColumnTransformers 和 FeatureUnions 很辛苦,而且 Pipeline 对 Pandas 数据帧的支持对我来说还不够。

skdag should support everything you're trying to do natively without any need for custom classes to handle dataframes. skdag应该支持您在本地尝试执行的所有操作,而无需自定义类来处理数据帧。 It lets you build up your workflow as a graph so there's no need for FeatureUnions any more.它使您可以将工作流程构建为图形,因此不再需要 FeatureUnions。 Here's your example rewritten with skdag :这是用skdag重写的示例:

from skdag import DAGBuilder

dag = (
    DAGBuilder(infer_dataframe=True)
    .add_step("input", "passthrough")
    .add_step("imputer", SimpleImputer(strategy="median"), deps={"input": num_attribs})
    .add_step("attribs_adder", CombinedAttributesAdder(), deps=["imputer"])
    .add_step("std_scaler", StandardScaler(), deps=["attribs_adder"])
    .add_step("label_binarizer", LabelBinarizer(), deps={"input": cat_attribs})
    .add_step("merged", "passthrough", deps=["std_scaler", "label_binarizer"])
    .make_dag()
)

dag.fit_transform(housing)

If you want to visualise the graph, you can call dag.show() in an interactive environment like Jupyter Notebooks, or dag.draw() to produce an image or text file:如果要可视化图形,可以在 Jupyter Notebooks 等交互式环境中调用dag.show()或 dag.draw() 来生成图像或文本文件:

dag.show()

达格

Full documentation can be found at https://skdag.readthedocs.io/ .完整的文档可以在https://skdag.readthedocs.io/找到。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM