简体   繁体   English

如何在没有交叉验证的情况下检查机器学习的准确性

[英]How to check machine learning accuracy without cross validation

I have training sample X_train , and Y_train to train and X_estimated .我有训练样本X_trainY_train来训练和X_estimated I got task to make my classificator learn as accurate as it can, and then predict vector of results over X_estimated to get close results to Y_estimated (which i have now, and I have to be as much precise as it can).我的任务是让我的分类器尽可能准确地学习,然后预测X_estimated上的结果向量以获得接近Y_estimated的结果(我现在有,而且我必须尽可能精确)。 If I split my training data to like 75/25 to train and test it, I can get accuracy using sklearn.metrics.accuracy_score and confusion matrix.如果我将我的训练数据分成 75/25 来训练和测试它,我可以使用sklearn.metrics.accuracy_score和混淆矩阵来获得准确性。 But I am losing that 25% of samples, that would make my predictions more accurate.但是我丢失了 25% 的样本,这将使我的预测更加准确。

Is there any way, I could learn by using 100% of the data, and still be able to see accuracy score (or percentage), so I can predict it many times, and save best (%) result?有什么办法,我可以通过使用 100% 的数据来学习,并且仍然能够看到准确度分数(或百分比),所以我可以多次预测,并保存最佳(%)结果? I am using random forest with 500 estimators, and usually get like 90% accuracy.我正在使用具有 500 个估计器的随机森林,通常可以达到 90% 的准确率。 I want to save best prediction vector as possible for my task, without splitting any data (not wasting anything), but still be able to calculate accuracy (so I can save best prediction vector) from multiple attempts (random forest always shows different results)我想为我的任务尽可能保存最佳预测向量,而不拆分任何数据(不浪费任何东西),但仍然能够从多次尝试中计算准确性(因此我可以保存最佳预测向量)(随机森林总是显示不同的结果)

Thank you谢谢

Splitting your data is critical for evaluation.拆分数据对于评估至关重要。 There is no way that you could train your model on 100% of the data and be able to get a correct evaluation accuracy unless you expand your dataset.除非您扩展数据集,否则您无法在 100% 的数据上训练您的 model 并且能够获得正确的评估准确性。 I mean, you could change your train/test split, or try to optimize your model in other ways, but i guess the simple answer to your question would be no.我的意思是,您可以更改您的训练/测试拆分,或尝试以其他方式优化您的 model,但我想您的问题的简单答案是否定的。

As per your requirement, you can try K Fold Cross Validation .根据您的要求,您可以尝试K Fold Cross Validation If you split it in 90|10 ie for Train|Test.如果将其拆分为 90|10,即用于训练|测试。
Achieving to take 100% data for training is not possible as you have to test the data then only you can validate the same that how good your model is.实现 100% 的数据进行训练是不可能的,因为您必须测试数据然后才能验证 model 有多好。 K Fold CV takes your whole train data into consideration in each fold and randomly takes test data sample from the train data. K Fold CV 在每个折叠中都会考虑您的整个训练数据,并从训练数据中随机抽取测试数据样本。
And lastly calculates the accuracy by taking summation of all the folds.最后通过对所有折叠求和来计算准确度。 Then finally you can test the accuracy by using 10% of the data.最后,您可以使用 10% 的数据来测试准确性。 More you can read here and here更多你可以在这里这里阅读

K Fold Cross Validation K折交叉验证

在此处输入图像描述

Skearn provides simple methods for performing K fold cross validation. Searn 提供了执行 K 折交叉验证的简单方法。 Simply you have to pass no of folds in the method.只需在方法中传递 no 折叠即可。 But then remember, more the folds, it takes more time to train the model.但请记住,折叠越多,训练 model 就需要更多时间。 More you can check here更多你可以在这里查看

It is not necessary to do 75|25 split of your data all the time.不必一直对数据进行 75|25 拆分。 75 |25 is kind of old school now. 75 |25 现在有点老派了。 It greatly depends on the amount of data that you have.这在很大程度上取决于您拥有的数据量。 For example, if you have 1 billion sentences for training a language model, it is not necessary to reserve 25% for testing.例如,如果您有 10 亿个句子用于训练语言 model,则无需保留 25% 用于测试。

Also, I second the previous answer of trying K-fold cross-validation.另外,我支持之前尝试 K 折交叉验证的答案。 As a side note, you could consider looking at the other metrics like precision and recall as well.作为旁注,您可以考虑查看其他指标,例如精确度和召回率。

In general splitting your data set is critical for evaluation .一般来说,拆分数据集对于评估至关重要 So I would recommend you always do that.所以我建议你总是这样做。

Said that, there are methods that in some sense allow you to train on all your data and still get an estimate of your performance or to estimate the generalization accuracy.也就是说,在某种意义上,有些方法可以让您在所有数据上进行训练,并且仍然可以估计您的性能或估计泛化精度。 One particularly prominent method is leveraging out-of-bag samples of models based on bootstrapping, ie RandomForests.一种特别突出的方法是利用基于自举的模型的袋外样本,即随机森林。

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, bootstrap=True, oob_score=True)
rf.fit(X, y)
print(rf.oob_score_)

if you are doing classification always go with stratified k-fold cv( https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/ ).如果您总是在进行分类 go 与分层 k 折 cv( https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/ )。 if you're doing regression then go with simple k-fold cv or you can divide the target as bins and do stratified k-fold cv.如果你正在做回归,那么 go 和简单的 k-fold cv 或者你可以将目标划分为 bin 并进行分层 k-fold cv。 by this way you can use your data completely in model training.通过这种方式,您可以在 model 训练中完全使用您的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM