在 piepline 中使用特征选择和 ML model 时，如何确保 sklearn piepline 应用 fit_transform 方法？

Question

Assume that I want to apply several feature selection methods using sklearn pipeline.假设我想使用 sklearn 管道应用几种特征选择方法。 An example is provided below:下面提供了一个示例：

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split


X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

fs_pipeline = Pipeline([('vt', VarianceThreshold(0.01)),
                        ('kbest', SelectKBest(chi2, k=5)),
                        ])

X_new = fs_pipeline.fit_transform(X_train, y_train)

I get the selected features using fit_transform method.我使用fit_transform方法获得选定的特征。 If I use fit method on pipeline, I will get pipeline object.如果我在管道上使用fit方法，我将得到管道 object。

Now, assume that I want to add a ML model to the pipeline like below:现在，假设我想将 ML model 添加到管道中，如下所示：

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)


model = Pipeline([('vt', VarianceThreshold(0.01)),
                  ('kbest', SelectKBest(chi2, k=5)),
                  ('gbc', GradientBoostingClassifier(random_state=0))])


model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

If I use fit_transform method in the above code ( model.fit_transform(X_train, y_train) ), I get the error:如果我在上面的代码中使用fit_transform方法（ model.fit_transform(X_train, y_train) ），我得到错误：

AttributeError: 'GradientBoostingClassifier' object has no attribute 'transform'

So.所以。 I should use model.fit(X_train, y_train) .我应该使用model.fit(X_train, y_train) 。 But, how can I be sure that pipeline applied fit_transform method for feature selection steps?但是，我如何确定管道将fit_transform方法应用于特征选择步骤？

Answer 1

A pipeline is meant for sequential data transformation (for which it needs multiple calls to .fit_transform() ).管道用于顺序数据转换（为此它需要多次调用.fit_transform() ）。 You can be sure that .fit_transform() is called on the intermediate steps (basically on all steps but the last one) of a pipeline as that's how it works by design.您可以确定在管道的中间步骤（基本上在除最后一个步骤之外的所有步骤）上调用了.fit_transform() ，因为它是按设计工作的。

Namely, when calling .fit() or .fit_transform() on a Pipeline instance, .fit_transform() is called sequentially on all intermediate transformers but the last one and the output of each call of the method is passed as parameter to the next call.也就是说，当在 Pipeline 实例上调用.fit()或.fit_transform()时， .fit_transform()会在所有中间转换器上按顺序调用，但最后一个转换器除外，并且每次调用该方法的 output 作为参数传递给下一个调用. On the very last step, either .fit() or .fit_transform() is called depending on the method called on the pipeline itself;在最后一步，根据管道本身调用的方法调用.fit()或.fit_transform() ； indeed, in the last step an estimator is generally more commonly used rather than a transformer (as with the case of your GradientBoostingClassifier ).实际上，在最后一步中，估计器通常比转换器更常用（就像GradientBoostingClassifier的情况一样）。

Whenever the last step is made of an estimator rather than a transformer , as in your case, you won't be able to call .fit_transform() on the pipeline instance as the pipeline itself exposes the same methods of the final estimator/transformer and in the considered case estimators do not expose neither .transform() nor .fit_transform() .每当最后一步是由估计器而不是转换器组成时，如您的情况，您将无法在管道实例上调用.fit_transform() ，因为管道本身公开了最终估计器/转换器的相同方法，并且在考虑的情况下，估计器既不公开.transform()也不.fit_transform() 。

Summing up,加起来，

case with an estimator in the last step (you can only call .fit() on the pipeline);在最后一步中使用估计器的情况（您只能在管道上调用.fit() ）； model.fit(X_train, y_train) means the following: model.fit(X_train, y_train)含义如下：
```
 final_estimator.fit(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))
```
which in your case becomes在你的情况下变成
```
 gbc.fit(k_best.fit_transform(vt.fit_transform(X_train, y_train)))
```
case with a transformer in the last step (you can either call .fit() or .fit_transform() on the pipeline, but let's suppose you're calling .fit_transform() );在最后一步使用变压器的情况（您可以在管道上调用.fit()或.fit_transform() ，但假设您正在调用.fit_transform() ）； model.fit_transform(X_train, y_train) means the following: model.fit_transform(X_train, y_train)含义如下：
```
 final_estimator.fit_transform(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))
```

Eventually, here's the reference in the source code: https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/pipeline.py#L351最后，这里是源代码中的参考： https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/pipeline.py#L351

在 piepline 中使用特征选择和 ML model 时，如何确保 sklearn piepline 应用 fit_transform 方法？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-07-25 09:18:42

在 piepline 中使用特征选择和 ML model 时，如何确保 sklearn piepline 应用 fit_transform 方法？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-07-25 09:18:42

解决方案1
1 已采纳 2022-07-25 09:18:42