简体   繁体   English

在 piepline 中使用特征选择和 ML model 时,如何确保 sklearn piepline 应用 fit_transform 方法?

[英]How to be sure that sklearn piepline applies fit_transform method when using feature selection and ML model in piepline?

Assume that I want to apply several feature selection methods using sklearn pipeline.假设我想使用 sklearn 管道应用几种特征选择方法。 An example is provided below:下面提供了一个示例:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split


X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

fs_pipeline = Pipeline([('vt', VarianceThreshold(0.01)),
                        ('kbest', SelectKBest(chi2, k=5)),
                        ])

X_new = fs_pipeline.fit_transform(X_train, y_train)

I get the selected features using fit_transform method.我使用fit_transform方法获得选定的特征。 If I use fit method on pipeline, I will get pipeline object.如果我在管道上使用fit方法,我将得到管道 object。

Now, assume that I want to add a ML model to the pipeline like below:现在,假设我想将 ML model 添加到管道中,如下所示:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)


model = Pipeline([('vt', VarianceThreshold(0.01)),
                  ('kbest', SelectKBest(chi2, k=5)),
                  ('gbc', GradientBoostingClassifier(random_state=0))])


model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

If I use fit_transform method in the above code ( model.fit_transform(X_train, y_train) ), I get the error:如果我在上面的代码中使用fit_transform方法( model.fit_transform(X_train, y_train) ),我得到错误:

AttributeError: 'GradientBoostingClassifier' object has no attribute 'transform'

So.所以。 I should use model.fit(X_train, y_train) .我应该使用model.fit(X_train, y_train) But, how can I be sure that pipeline applied fit_transform method for feature selection steps?但是,我如何确定管道将fit_transform方法应用于特征选择步骤?

A pipeline is meant for sequential data transformation (for which it needs multiple calls to .fit_transform() ).管道用于顺序数据转换(为此它需要多次调用.fit_transform() )。 You can be sure that .fit_transform() is called on the intermediate steps (basically on all steps but the last one) of a pipeline as that's how it works by design.您可以确定在管道的中间步骤(基本上在除最后一个步骤之外的所有步骤)上调用了.fit_transform() ,因为它是按设计工作的。

Namely, when calling .fit() or .fit_transform() on a Pipeline instance, .fit_transform() is called sequentially on all intermediate transformers but the last one and the output of each call of the method is passed as parameter to the next call.也就是说,当在 Pipeline 实例上调用.fit().fit_transform()时, .fit_transform()会在所有中间转换器上按顺序调用,但最后一个转换器除外,并且每次调用该方法的 output 作为参数传递给下一个调用. On the very last step, either .fit() or .fit_transform() is called depending on the method called on the pipeline itself;在最后一步,根据管道本身调用的方法调用.fit().fit_transform() indeed, in the last step an estimator is generally more commonly used rather than a transformer (as with the case of your GradientBoostingClassifier ).实际上,在最后一步中,估计器通常比转换器更常用(就像GradientBoostingClassifier的情况一样)。

Whenever the last step is made of an estimator rather than a transformer , as in your case, you won't be able to call .fit_transform() on the pipeline instance as the pipeline itself exposes the same methods of the final estimator/transformer and in the considered case estimators do not expose neither .transform() nor .fit_transform() .每当最后一步是由估计器而不是转换器组成时,如您的情况,您将无法在管道实例上调用.fit_transform() ,因为管道本身公开了最终估计器/转换器的相同方法,并且在考虑的情况下,估计器既不公开.transform()也不.fit_transform()

Summing up,加起来,

  • case with an estimator in the last step (you can only call .fit() on the pipeline);在最后一步中使用估计器的情况(您只能在管道上调用.fit() ); model.fit(X_train, y_train) means the following: model.fit(X_train, y_train)含义如下:

     final_estimator.fit(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))

    which in your case becomes在你的情况下变成

     gbc.fit(k_best.fit_transform(vt.fit_transform(X_train, y_train)))
  • case with a transformer in the last step (you can either call .fit() or .fit_transform() on the pipeline, but let's suppose you're calling .fit_transform() );在最后一步使用变压器的情况(您可以在管道上调用.fit().fit_transform() ,但假设您正在调用.fit_transform() ); model.fit_transform(X_train, y_train) means the following: model.fit_transform(X_train, y_train)含义如下:

     final_estimator.fit_transform(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))

Eventually, here's the reference in the source code: https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/pipeline.py#L351最后,这里是源代码中的参考: https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/pipeline.py#L351

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 矢量化fit_transform如何在sklearn中工作? - How vectorizer fit_transform work in sklearn? 使用sklearn时python中的fit,transform和fit_transform有什么区别? - What is difference between fit, transform and fit_transform in python when using sklearn? ColumnTransformer 在 sklearn 中尝试 fit_transform 管道时生成 TypeError - ColumnTransformer generating a TypeError when trying to fit_transform pipeline in sklearn 如何在两列上使用 sklearn TfidfVectorizer fit_transform - How to use sklearn TfidfVectorizer fit_transform on two columns 使用 fit_transform() 和 transform() - Using fit_transform() and transform() 当我们使用transform得到相同的output时为什么要使用fit_transform方法 - Why should we use the fit_transform method when we get the same output using transform 不同的 output 同时使用 fit_transform vs fit and transform from sklearn - Different output while using fit_transform vs fit and transform from sklearn sklearn 中的 ColumnTransformer 实现没有定义 fit 方法,它只是自动调用 fit_transform? - ColumnTransformer implementation in sklearn doesn't have a fit method defined, it just automatically calls fit_transform? 将 fit_transform 与 OneHotEncoder 一起使用时出现 Memory 错误 - Memory error when using fit_transform with OneHotEncoder 如何将 sklearn 预处理器 fit_transform 与 pandas.groupby.transform 一起使用 - How to use sklearn preprocessor fit_transform with pandas.groupby.transform
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM