简体   繁体   中英

How to check machine learning accuracy without cross validation

I have training sample X_train , and Y_train to train and X_estimated . I got task to make my classificator learn as accurate as it can, and then predict vector of results over X_estimated to get close results to Y_estimated (which i have now, and I have to be as much precise as it can). If I split my training data to like 75/25 to train and test it, I can get accuracy using sklearn.metrics.accuracy_score and confusion matrix. But I am losing that 25% of samples, that would make my predictions more accurate.

Is there any way, I could learn by using 100% of the data, and still be able to see accuracy score (or percentage), so I can predict it many times, and save best (%) result? I am using random forest with 500 estimators, and usually get like 90% accuracy. I want to save best prediction vector as possible for my task, without splitting any data (not wasting anything), but still be able to calculate accuracy (so I can save best prediction vector) from multiple attempts (random forest always shows different results)

Thank you

Splitting your data is critical for evaluation. There is no way that you could train your model on 100% of the data and be able to get a correct evaluation accuracy unless you expand your dataset. I mean, you could change your train/test split, or try to optimize your model in other ways, but i guess the simple answer to your question would be no.

As per your requirement, you can try K Fold Cross Validation . If you split it in 90|10 ie for Train|Test.
Achieving to take 100% data for training is not possible as you have to test the data then only you can validate the same that how good your model is. K Fold CV takes your whole train data into consideration in each fold and randomly takes test data sample from the train data.
And lastly calculates the accuracy by taking summation of all the folds. Then finally you can test the accuracy by using 10% of the data. More you can read here and here

K Fold Cross Validation

在此处输入图像描述

Skearn provides simple methods for performing K fold cross validation. Simply you have to pass no of folds in the method. But then remember, more the folds, it takes more time to train the model. More you can check here

It is not necessary to do 75|25 split of your data all the time. 75 |25 is kind of old school now. It greatly depends on the amount of data that you have. For example, if you have 1 billion sentences for training a language model, it is not necessary to reserve 25% for testing.

Also, I second the previous answer of trying K-fold cross-validation. As a side note, you could consider looking at the other metrics like precision and recall as well.

In general splitting your data set is critical for evaluation . So I would recommend you always do that.

Said that, there are methods that in some sense allow you to train on all your data and still get an estimate of your performance or to estimate the generalization accuracy. One particularly prominent method is leveraging out-of-bag samples of models based on bootstrapping, ie RandomForests.

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, bootstrap=True, oob_score=True)
rf.fit(X, y)
print(rf.oob_score_)

if you are doing classification always go with stratified k-fold cv( https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/ ). if you're doing regression then go with simple k-fold cv or you can divide the target as bins and do stratified k-fold cv. by this way you can use your data completely in model training.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM