[英]How to be sure that sklearn piepline applies fit_transform method when using feature selection and ML model in piepline?
Assume that I want to apply several feature selection methods using sklearn pipeline.假设我想使用 sklearn 管道应用几种特征选择方法。 An example is provided below:
下面提供了一个示例:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
fs_pipeline = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
])
X_new = fs_pipeline.fit_transform(X_train, y_train)
I get the selected features using fit_transform
method.我使用
fit_transform
方法获得选定的特征。 If I use fit
method on pipeline, I will get pipeline object.如果我在管道上使用
fit
方法,我将得到管道 object。
Now, assume that I want to add a ML model to the pipeline like below:现在,假设我想将 ML model 添加到管道中,如下所示:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
('gbc', GradientBoostingClassifier(random_state=0))])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
If I use fit_transform
method in the above code ( model.fit_transform(X_train, y_train)
), I get the error:如果我在上面的代码中使用
fit_transform
方法( model.fit_transform(X_train, y_train)
),我得到错误:
AttributeError: 'GradientBoostingClassifier' object has no attribute 'transform'
So.所以。 I should use
model.fit(X_train, y_train)
.我应该使用
model.fit(X_train, y_train)
。 But, how can I be sure that pipeline applied fit_transform
method for feature selection steps?但是,我如何确定管道将
fit_transform
方法应用于特征选择步骤?
A pipeline is meant for sequential data transformation (for which it needs multiple calls to .fit_transform()
).管道用于顺序数据转换(为此它需要多次调用
.fit_transform()
)。 You can be sure that .fit_transform()
is called on the intermediate steps (basically on all steps but the last one) of a pipeline as that's how it works by design.您可以确定在管道的中间步骤(基本上在除最后一个步骤之外的所有步骤)上调用了
.fit_transform()
,因为它是按设计工作的。
Namely, when calling .fit()
or .fit_transform()
on a Pipeline instance, .fit_transform()
is called sequentially on all intermediate transformers but the last one and the output of each call of the method is passed as parameter to the next call.也就是说,当在 Pipeline 实例上调用
.fit()
或.fit_transform()
时, .fit_transform()
会在所有中间转换器上按顺序调用,但最后一个转换器除外,并且每次调用该方法的 output 作为参数传递给下一个调用. On the very last step, either .fit()
or .fit_transform()
is called depending on the method called on the pipeline itself;在最后一步,根据管道本身调用的方法调用
.fit()
或.fit_transform()
; indeed, in the last step an estimator is generally more commonly used rather than a transformer (as with the case of your GradientBoostingClassifier
).实际上,在最后一步中,估计器通常比转换器更常用(就像
GradientBoostingClassifier
的情况一样)。
Whenever the last step is made of an estimator rather than a transformer , as in your case, you won't be able to call .fit_transform()
on the pipeline instance as the pipeline itself exposes the same methods of the final estimator/transformer and in the considered case estimators do not expose neither .transform()
nor .fit_transform()
.每当最后一步是由估计器而不是转换器组成时,如您的情况,您将无法在管道实例上调用
.fit_transform()
,因为管道本身公开了最终估计器/转换器的相同方法,并且在考虑的情况下,估计器既不公开.transform()
也不.fit_transform()
。
Summing up,加起来,
case with an estimator in the last step (you can only call .fit()
on the pipeline);在最后一步中使用估计器的情况(您只能在管道上调用
.fit()
); model.fit(X_train, y_train)
means the following: model.fit(X_train, y_train)
含义如下:
final_estimator.fit(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))
which in your case becomes在你的情况下变成
gbc.fit(k_best.fit_transform(vt.fit_transform(X_train, y_train)))
case with a transformer in the last step (you can either call .fit()
or .fit_transform()
on the pipeline, but let's suppose you're calling .fit_transform()
);在最后一步使用变压器的情况(您可以在管道上调用
.fit()
或.fit_transform()
,但假设您正在调用.fit_transform()
); model.fit_transform(X_train, y_train)
means the following: model.fit_transform(X_train, y_train)
含义如下:
final_estimator.fit_transform(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))
Eventually, here's the reference in the source code: https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/pipeline.py#L351最后,这里是源代码中的参考: https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/pipeline.py#L351
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.