简体   繁体   English

了解机器学习的交叉验证

[英]Understanding Cross Validation for Machine learning

Is the following correct about cross validation?:以下关于交叉验证的说法正确吗?:

The training data is divided into different groups, all but one of the training data sets is used for training the model. Once the model is trained the 'left out' training data is used to perform hyperparameter tuning.训练数据被分成不同的组,除一个训练数据集外,所有训练数据集都用于训练 model。一旦训练了 model,“遗漏”的训练数据将用于执行超参数调整。 Once the most optimal hyperparameters have been chosen the test data is applied to the model to give a result which is then compared to other models that have undergone a similar process but with different combinations of training data sets.一旦选择了最佳超参数,测试数据将应用于 model 以给出结果,然后将其与经历过类似过程但具有不同训练数据集组合的其他模型进行比较。 The model with the best results on the test data is then chosen.然后选择在测试数据上具有最佳结果的 model。

在此处输入图像描述

I don't think it is correct.我不认为这是正确的。 You wrote:你写了:

Once the model is trained the 'left out' training data is used to perform hyperparameter tuning一旦 model 被训练,“遗漏”训练数据用于执行超参数调整

You tune the model by picking (manually or using a method like grid search or random search) a set of model's hyperparameters (parameters which values are set by you, before you will even fit the model to data).您通过选择(手动或使用网格搜索或随机搜索等方法)一组模型的超参数(您设置值的参数,甚至在您将 model 拟合到数据之前)来调整 model。 Then for a selected set of hyperparameters' values you calculate the validation set error using Cross-Validation.然后,对于一组选定的超参数值,您可以使用交叉验证计算验证集误差。

So it should be like this:所以它应该是这样的:

The training data is divided into different groups, all but one of the training data sets is used for training the model. Once the model is trained the 'left out' training data is used to...训练数据分为不同的组,除了一个训练数据集外,所有训练数据集都用于训练 model。一旦训练了 model,“遗漏”的训练数据用于...

... calculate the error. ...计算误差。 At the end of the cross validation, you will have k errors calculated on k left out sets.在交叉验证结束时,您将在 k 个遗漏集上计算出 k 个错误。 What you do next is calculating a mean of these k errors which gives you a single value - validation set error.您接下来要做的是计算这 k 个错误的平均值,这会为您提供一个值 - 验证集错误。

If you have n sets of hyperparameters, you simply repeat the procedure n times, which gives you n validation set errors.如果你有 n 组超参数,你只需重复该过程 n 次,这会给你 n 个验证集错误。 You then pick this set, that gave you the smallest validation error.然后你选择这个给你最小验证错误的集合。

At the end, you will typically calculate the test set error to see what is the model's performance on unseen data, which simulates putting a model into production and to see whether there is a difference between test set error and validation set error.最后,您通常会计算测试集误差以查看模型在未见数据上的性能如何,模拟将 model 投入生产并查看测试集误差和验证集误差之间是否存在差异。 If there is a significant difference, it means over-fitting.如果存在显着差异,则意味着过拟合。

Just to add something on cross-validation itself, the reason why we use k-CV or LOOCV is that it is great test set error estimate, which means that when I manipulate with hyperparameters and the value of validation set error dropped down, I know that I really improved model instead of being lucky and simply better fitting the model to train set.只是在交叉验证本身上添加一些东西,我们使用 k-CV 或 LOOCV 的原因是它是很好的测试集错误估计,这意味着当我使用超参数进行操作并且验证集错误的值下降时,我知道我真的改进了 model 而不是幸运,只是更好地使 model 适合训练集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在没有交叉验证的情况下检查机器学习的准确性 - How to check machine learning accuracy without cross validation 如何从正常的机器学习技术转变为交叉验证? - How to change from normal machine learning technique to cross validation? 理解机器学习中的主成分分析 - understanding pca in machine learning 使用交叉验证来确定机器学习算法的权重(GridSearchCv、RidgeCV、StackingClassifier) - Using cross-validation to determine weights of machine learning algorithms (GridSearchCv,RidgeCV,StackingClassifier) 使用 kfold 交叉验证进行深度学习 - deep learning with kfold cross validation 了解 fbprophet cross_validation - understanding fbprophet cross_validation 了解机器学习过程和Kfold交叉验证 - Understanding machine learning process and Kfold crossvalidation 在留一法交叉验证中,我如何使用 `shap.Explainer()` 函数来解释机器学习模型? - In Leave One Out Cross Validation, How can I Use `shap.Explainer()` Function to Explain a Machine Learning Model? 如何在机器学习模型中使用train.csv,test.csv和ground_truth.csv? (交叉验证/ python) - How to use a train.csv , test.csv and ground_truth.csv in a machine learning model? (cross validation/ python) 持续验证精度高,机器学习损失高 - Constant Validation Accuracy with a high loss in machine learning
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM