简体繁体中英

Confusion around the SKLearn GridSearchCV scoring parameter and using train test split

原文 2021-07-27 09:24:29 0 1 python/ scikit-learn/ rdkit

I'm a little bit confused about how GridSearchCV works with Train Test Split.
As far as I know, when creating models for the dataset I'm using, a paper used roc-auc.
I'm trying to replicate what this paper did, at least as well as I can. From reading a few other posts here, I've gathered that running GridSearchCV on the entire dataset is prone to overfitting, so we should split the data into a training partition and a testing partition. Then, we should run the training partition with GridSearchCV with whatever model and parameters, and then fit it, and then get a score using the test part of the dataset we set aside.

Now where I'm confused is with GridSearchCV, as far as I understand, it gives us scores for each of the folds that the data is split into when doing the search for parameters and using best_score_ we can pull the best of these scores. I don't understand what the scores represent and why you can pass in a scoring parameter to begin with, since the job of GridSearchCV is to always find the best possible parameters anyways? (Perhaps I'm making a poor assumption here but I'm assuming that there is an objective best set of parameters, regardless of scoring method). What I figured was that I would find the best parameters with GridSearchCV and then use the said parameters to create fit a model, and finally use that model and the partition I saved for testing and test it using the roc-auc scoring method.

So in the end, does it matter (if at all) what scoring methods I'm passing into GridSearchCV, as it will always look to give the best set of parameters anyways, which I will use to compute my final score with the testing partition?

1 answers

This document may help.

Here you see that the scoring parameter allows you to have various metrics, such as roc_auc . See here all Scikit's metrics .
Optimizing over different metrics result in different optimal parameters. Just think about optimizing precision versus recall. Optimizing precision leads to less false positives while optimizing recall leads to less false negatives.
Also, in GridSearchCV , the CV stands for cross validated. Train/test splitting happens inside this function, it's taken care of. You only have to provide the splitter as an argument to GridSearchCV , for example cv=StratifiedKFold(n_splits=5, shuffle=True) .

Sklearn train test split

random_state parameter in sklearn's train_test_split

sklearn train test split by year

sklearn GridSearchCV (Scoring Function error)

Issue while using different scoring metric in Gridsearchcv sklearn

Scoring parameter in GridSearchCV?

For Loop In Python using sklearn.model_selection.train_test_split

From train test split to cross validation in sklearn using pipeline

error splitting data using the train_test_split from sklearn

Train Test Split sklearn based on group variable

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Sklearn train test split random_state parameter in sklearn's train_test_split sklearn train test split by year sklearn GridSearchCV (Scoring Function error) Issue while using different scoring metric in Gridsearchcv sklearn Scoring parameter in GridSearchCV? For Loop In Python using sklearn.model_selection.train_test_split From train test split to cross validation in sklearn using pipeline error splitting data using the train_test_split from sklearn Train Test Split sklearn based on group variable

Related Tags

Confusion around the SKLearn GridSearchCV scoring parameter and using train test split

Question

1 answers

solution1 1 2021-07-27 09:47:58

solution1
1 2021-07-27 09:47:58