简体繁体 English

了解机器学习的交叉验证

[英]Understanding Cross Validation for Machine learning

原文 2020-09-21 17:58:59 5 1 python/ validation/ data-science/ cross-validation

Is the following correct about cross validation?:以下关于交叉验证的说法正确吗？：

The training data is divided into different groups, all but one of the training data sets is used for training the model. Once the model is trained the 'left out' training data is used to perform hyperparameter tuning.训练数据被分成不同的组，除一个训练数据集外，所有训练数据集都用于训练 model。一旦训练了 model，“遗漏”的训练数据将用于执行超参数调整。 Once the most optimal hyperparameters have been chosen the test data is applied to the model to give a result which is then compared to other models that have undergone a similar process but with different combinations of training data sets.一旦选择了最佳超参数，测试数据将应用于 model 以给出结果，然后将其与经历过类似过程但具有不同训练数据集组合的其他模型进行比较。 The model with the best results on the test data is then chosen.然后选择在测试数据上具有最佳结果的 model。

1 个解决方案

I don't think it is correct.我不认为这是正确的。 You wrote:你写了：

Once the model is trained the 'left out' training data is used to perform hyperparameter tuning一旦 model 被训练，“遗漏”训练数据用于执行超参数调整

You tune the model by picking (manually or using a method like grid search or random search) a set of model's hyperparameters (parameters which values are set by you, before you will even fit the model to data).您通过选择（手动或使用网格搜索或随机搜索等方法）一组模型的超参数（您设置值的参数，甚至在您将 model 拟合到数据之前）来调整 model。 Then for a selected set of hyperparameters' values you calculate the validation set error using Cross-Validation.然后，对于一组选定的超参数值，您可以使用交叉验证计算验证集误差。

So it should be like this:所以它应该是这样的：

The training data is divided into different groups, all but one of the training data sets is used for training the model. Once the model is trained the 'left out' training data is used to...训练数据分为不同的组，除了一个训练数据集外，所有训练数据集都用于训练 model。一旦训练了 model，“遗漏”的训练数据用于...

... calculate the error. ...计算误差。 At the end of the cross validation, you will have k errors calculated on k left out sets.在交叉验证结束时，您将在 k 个遗漏集上计算出 k 个错误。 What you do next is calculating a mean of these k errors which gives you a single value - validation set error.您接下来要做的是计算这 k 个错误的平均值，这会为您提供一个值 - 验证集错误。

If you have n sets of hyperparameters, you simply repeat the procedure n times, which gives you n validation set errors.如果你有 n 组超参数，你只需重复该过程 n 次，这会给你 n 个验证集错误。 You then pick this set, that gave you the smallest validation error.然后你选择这个给你最小验证错误的集合。

At the end, you will typically calculate the test set error to see what is the model's performance on unseen data, which simulates putting a model into production and to see whether there is a difference between test set error and validation set error.最后，您通常会计算测试集误差以查看模型在未见数据上的性能如何，模拟将 model 投入生产并查看测试集误差和验证集误差之间是否存在差异。 If there is a significant difference, it means over-fitting.如果存在显着差异，则意味着过拟合。

Just to add something on cross-validation itself, the reason why we use k-CV or LOOCV is that it is great test set error estimate, which means that when I manipulate with hyperparameters and the value of validation set error dropped down, I know that I really improved model instead of being lucky and simply better fitting the model to train set.只是在交叉验证本身上添加一些东西，我们使用 k-CV 或 LOOCV 的原因是它是很好的测试集错误估计，这意味着当我使用超参数进行操作并且验证集错误的值下降时，我知道我真的改进了 model 而不是幸运，只是更好地使 model 适合训练集。