简体繁体 English

使用交叉验证来确定机器学习算法的权重（GridSearchCv、RidgeCV、StackingClassifier）

[英]Using cross-validation to determine weights of machine learning algorithms (GridSearchCv,RidgeCV,StackingClassifier)

原文 2020-07-15 19:44:02 5 1 python/ scikit-learn/ cross-validation/ gridsearchcv

My question has to do with GridSearchCV, RidgeCV, and StackingClassifier/Regressor.我的问题与 GridSearchCV、RidgeCV 和 StackingClassifier/Regressor 有关。

Stacking Classifier/Regressor-AFAIK, it first trains the whole train set individually for each base estimator. Stacking Classifier/Regressor-AFAIK，它首先为每个基本估计器单独训练整个训练集。 Then, it uses a cross validation scheme, using the predictions for each base estimator as the new features to train the new final estimator.然后，它使用交叉验证方案，使用每个基本估计器的预测作为新特征来训练新的最终估计器。 From the documentation: "To generalize and avoid over-fitting, the final_estimator is trained on out-samples using sklearn.model_selection.cross_val_predict internally."来自文档：“为了概括和避免过度拟合，final_estimator 在内部使用 sklearn.model_selection.cross_val_predict 对样本外进行训练。”

My question is, what exactly does this mean?我的问题是，这到底是什么意思？ Does it break the train data into k folds, and then for each fold, train the final estimator on the training section of the fold, test it on the testing section of the fold, and then take the final estimator weights from the fold with the best score?它是否将训练数据分成 k 折，然后对于每折，在折的训练部分训练最终估计器，在折的测试部分对其进行测试，然后从折中获取最终估计器的权重最佳得分？ or what?或者是什么？

I think I can group GridSearchCV and RidgeCV into the same question as they are quite similar.我想我可以将 GridSearchCV 和 RidgeCV 归为同一个问题，因为它们非常相似。 ( albeit, ridgeCV uses one vs all CV by default) （尽管 ridgeCV 默认使用一个与所有 CV）

-To find the best hyperparameters, do they do a CV on all the folds, for each hyperparameter, find the hyperparameters that had the best average score AND THEN AFTER finding the best hyperparameters, train the model with the best hyperparameters, using the WHOLE training set? - 要找到最佳超参数，他们是否对所有折叠进行 CV，对于每个超参数，找到具有最佳平均得分的超参数，然后在找到最佳超参数后，使用最佳超参数训练 model，使用整个训练放？ Or am I looking at it wrong?还是我看错了？

If anyone could shed some light on this, that would be great.如果有人能对此有所了解，那就太好了。 Thanks!谢谢！

1 个解决方案

You're exactly right.你完全正确。 The process looks like this:该过程如下所示：

Select the first set of hyperparameters Select 第一组超参数
Partition the data into k-folds将数据分成 k 折
Run the model on each fold在每个折叠上运行 model
Obtain the average score (loss, r2, or whatever specified criteria)获得平均分数（损失、r2 或任何指定的标准）
Repeat steps 2-4 for all other sets of hyperparameters对所有其他超参数集重复步骤 2-4
Choose the set of hyperparameters with the best score选择得分最高的超参数集
Retrain the model on the entire dataset (as opposed to a single fold) using the best hyperparameters使用最佳超参数在整个数据集（而不是单折）上重新训练 model