简体   繁体   中英

GridsearchCV and Kfold Cross validation

I was trying to understand the sklearn's GridSearchCV . I was having few basic question about the use of cross validation in GridsearchCV and then how shall I use the GridsearchCV 's recommendations further

Say I declare a GridsearchCV instance as below

from sklearn.grid_search import GridSearchCV
RFReg = RandomForestRegressor(random_state = 1) 

param_grid = { 
    'n_estimators': [100, 500, 1000, 1500],
    'max_depth' : [4,5,6,7,8,9,10]
}

CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)

I had below questions :

  1. Say in first iteration n_estimators = 100 and max_depth = 4 is selected for model building.Now will the score for this model be choosen with the help of 10 fold cross-validation ?

    • a. My understanding about the process is as follows

      • 1. X_train and y_train will be splitted in to 10 sets.
        1. Model will be trained on 9 sets and tested on 1 remaining set and its score will be stored in a list: say score_list
        1. This process will be repeated 9 more times and each of this 9 scores will be added to the score_list to give 10 score in all
        1. Finally the average of the score_list will be taken to give a final_score for the model with parameters : n_estimators = 100 and max_depth = 4
    • b. The above process will repeated with all other possible combinations of n_estimators and max_depth and each time we will get a final_score for that model

    • c. The best model will be the model having highest final_score and we will get corresponding best values of 'n_estimators' and 'max_depth' by CV_rfc.best_params_

Is my understanding about GridSearchCV correct ?

  1. Now say I get best model parameters as {'max_depth': 10, 'n_estimators': 100} . I declare an intance of the model as below

RFReg_best = RandomForestRegressor(n_estimators = 100, max_depth = 10, random_state = 1)

I now have two options which of it is correct is what I wanted to know

a. Use cross validation for entire dataset to see how well the model is performing as below

scores = cross_val_score(RFReg_best , X, y, cv = 10, scoring = 'mean_squared_error')
   rm_score = -scores
   rm_score = np.sqrt(rm_score)

b. Fit the model on X_train, y_train and then test in on X_test, y_test

RFReg_best.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
rm_score = np.sqrt(mean_squared_error(y_test, y_pred))

Or both of them are correct

Regarding (1), your understanding is indeed correct; a wording detail to be corrected in principle is " better final_score " instead of "higher", as there are several performance metrics (everything measuring the error , such as MSE, MAE etc) that are the- lower -the-better.

Now, step (2) is more tricky; it requires taking a step back to check the whole procedure...

To start with, in general CV is used either for parameter tuning (your step 1) or for model assessment (ie what you are trying to do in step 2), which are different things indeed. Splitting from the very beginning your data into training & test sets as you have done here, and then sequentially performing the steps 1 (for parameter tuning) and 2b (model assessment in unseen data) is arguably the most "correct" procedure in principle (as for the bias you note in the comment, this is something we have to live with, since by default all our fitted models are "biased" toward the data used for their training, and this cannot be avoided).

Nevertheless, since early on, practitioners have been wondering if they can avoid "sacrificing" a part of their precious data only for testing (model assessment) purposes, and trying to see if they can actually skip the model assessment part (and the test set itself), using as model assessment the best results obtained from the parameter tuning procedure (your step 1). This is clearly cutting corners, but, as usually, the question is how off the actual results will be? and will it still be meaninful?

Again, in theory , what Vivek Kumar writes in his linked answer is correct:

If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.

But here is a relevant excerpt of the (highly recommended) Applied Predictive Modeling book (p. 78):

在此处输入图片说明

In short: if you use the whole X in step 1 and consider the results of the tuning as model assessment, there will indeed be a bias/leakage, but it is usually small, at least for moderately large training sets...


Wrapping-up:

  • The "most correct" procedure in theory is indeed the combination of your steps 1 and 2b
  • You can try to cut corners, using the whole training set X in step 1, and most probably you will still be within acceptable limits regarding your model assessment.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM