简体   繁体   English

在 Sklearn 管道和交叉验证中使用缩放器

[英]Using scaler in Sklearn PIpeline and Cross validation

I previously saw a post with code like this:我之前看到一个帖子,里面有这样的代码:

scalar = StandardScaler()
clf = svm.LinearSVC()

pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])

cv = KFold(n_splits=4)
scores = cross_val_score(pipeline, X, y, cv = cv)

My understanding is that: when we apply scaler, we should use 3 out of the 4 folds to calculate mean and standard deviation, then we apply the mean and standard deviation to all 4 folds.我的理解是:当我们应用 scaler 时,我们应该使用 4 折中的3计算均值和标准差,然后我们将均值和标准差应用于所有 4 折。

In the above code, how can I know that Sklearn is following the same strategy?在上面的代码中,我怎么知道 Sklearn 是否遵循相同的策略? On the other hand, if sklearn is not following the same strategy, which means sklearn would calculate the mean/std from all 4 folds.另一方面,如果 sklearn 没有遵循相同的策略,这意味着 sklearn 将计算所有 4 折的均值/标准差。 Would that mean I should not use the above codes?这是否意味着我不应该使用上述代码?

I do like the above codes because it saves tons of time.我确实喜欢上面的代码,因为它可以节省大量时间。

In the example you gave, I would add an additional step using sklearn.model_selection.train_test_split :在您给出的示例中,我将使用sklearn.model_selection.train_test_split添加一个额外的步骤:

folds = 4

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1/folds), random_state=0, stratify=y)

scalar = StandardScaler()
clf = svm.LinearSVC()

pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])

cv = KFold(n_splits=(folds - 1))
scores = cross_val_score(pipeline, X_train, y_train, cv = cv)

I think best practice is to only use the training data set (ie, X_train, y_train ) when tuning the hyperparameters of your model, and the test data set (ie, X_test, y_test ) should be used as a final check, to make sure your model isn't biased towards the validation folds.我认为最佳做法是在调整 model 的超参数时仅使用训练数据集(即X_train, y_train ),并且应使用测试数据集(即X_test, y_test )作为最终检查,以确保您的 model 不偏向验证折叠。 At that point you would apply the same scaler that you fit on your training data set to your testing data set.那时,您会将适合您的训练数据集的相同scaler应用于您的测试数据集。

Yes, this is done properly;是的,这是正确的; this is one of the reasons for using pipelines: all the preprocessing is fitted only on training folds.这是使用管道的原因之一:所有预处理都只适用于训练折叠。


Some references.一些参考资料。

Section 6.1.1 of the User Guide:用户指南第 6.1.1 节

Safety安全
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.通过确保使用相同的样本来训练转换器和预测器,管道有助于避免在交叉验证中将测试数据中的统计信息泄漏到经过训练的 model 中。

The note at the end of section 3.1.1 of the User Guide : 用户指南第 3.1.1 节末尾的注释:

Data transformation with held out data保留数据的数据转换
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:正如在训练中保留的数据上测试预测器很重要一样,预处理(例如标准化、特征选择等)和类似的数据转换也应该从训练集中学习并应用于保留的数据以进行预测:
...code sample... ...代码示例...
A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation: Pipeline 可以更轻松地组合估算器,在交叉验证下提供此行为:
... ...

Finally, you can look into the source for cross_val_score .最后,您可以查看cross_val_score的来源。 It calls cross_validate , which clones and fits the estimator (in this case, the entire pipeline) on each training split.它调用cross_validate ,它在每个训练拆分上克隆并拟合估计器(在本例中为整个管道)。 GitHub link . GitHub 链接

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM