简体   繁体   中英

How to be sure that sklearn piepline applies fit_transform method when using feature selection and ML model in piepline?

Assume that I want to apply several feature selection methods using sklearn pipeline. An example is provided below:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split


X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

fs_pipeline = Pipeline([('vt', VarianceThreshold(0.01)),
                        ('kbest', SelectKBest(chi2, k=5)),
                        ])

X_new = fs_pipeline.fit_transform(X_train, y_train)

I get the selected features using fit_transform method. If I use fit method on pipeline, I will get pipeline object.

Now, assume that I want to add a ML model to the pipeline like below:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)


model = Pipeline([('vt', VarianceThreshold(0.01)),
                  ('kbest', SelectKBest(chi2, k=5)),
                  ('gbc', GradientBoostingClassifier(random_state=0))])


model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

If I use fit_transform method in the above code ( model.fit_transform(X_train, y_train) ), I get the error:

AttributeError: 'GradientBoostingClassifier' object has no attribute 'transform'

So. I should use model.fit(X_train, y_train) . But, how can I be sure that pipeline applied fit_transform method for feature selection steps?

A pipeline is meant for sequential data transformation (for which it needs multiple calls to .fit_transform() ). You can be sure that .fit_transform() is called on the intermediate steps (basically on all steps but the last one) of a pipeline as that's how it works by design.

Namely, when calling .fit() or .fit_transform() on a Pipeline instance, .fit_transform() is called sequentially on all intermediate transformers but the last one and the output of each call of the method is passed as parameter to the next call. On the very last step, either .fit() or .fit_transform() is called depending on the method called on the pipeline itself; indeed, in the last step an estimator is generally more commonly used rather than a transformer (as with the case of your GradientBoostingClassifier ).

Whenever the last step is made of an estimator rather than a transformer , as in your case, you won't be able to call .fit_transform() on the pipeline instance as the pipeline itself exposes the same methods of the final estimator/transformer and in the considered case estimators do not expose neither .transform() nor .fit_transform() .

Summing up,

  • case with an estimator in the last step (you can only call .fit() on the pipeline); model.fit(X_train, y_train) means the following:

     final_estimator.fit(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))

    which in your case becomes

     gbc.fit(k_best.fit_transform(vt.fit_transform(X_train, y_train)))
  • case with a transformer in the last step (you can either call .fit() or .fit_transform() on the pipeline, but let's suppose you're calling .fit_transform() ); model.fit_transform(X_train, y_train) means the following:

     final_estimator.fit_transform(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))

Eventually, here's the reference in the source code: https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/pipeline.py#L351

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM