I was trying to understand the sklearn's GridSearchCV . I was having few basic question about the use of cross validation in GridsearchCV and then how shall I use the GridsearchCV 's recommendations further
Say I declare a GridsearchCV instance as below
from sklearn.grid_search import GridSearchCV
RFReg = RandomForestRegressor(random_state = 1)
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
I had below questions :
Say in first iteration n_estimators = 100
and max_depth = 4
is selected for model building.Now will the score
for this model be choosen with the help of 10 fold cross-validation ?
a. My understanding about the process is as follows
X_train
and y_train
will be splitted in to 10 sets. score_list
score_list
to give 10 score in all n_estimators = 100
and max_depth = 4
b. The above process will repeated with all other possible combinations of n_estimators
and max_depth
and each time we will get a final_score for that model
c. The best model will be the model having highest final_score and we will get corresponding best values of 'n_estimators' and 'max_depth' by CV_rfc.best_params_
Is my understanding about GridSearchCV
correct ?
{'max_depth': 10, 'n_estimators': 100}
. I declare an intance of the model as below RFReg_best = RandomForestRegressor(n_estimators = 100, max_depth = 10, random_state = 1)
I now have two options which of it is correct is what I wanted to know
a. Use cross validation for entire dataset to see how well the model is performing as below
scores = cross_val_score(RFReg_best , X, y, cv = 10, scoring = 'mean_squared_error')
rm_score = -scores
rm_score = np.sqrt(rm_score)
b. Fit the model on X_train, y_train and then test in on X_test, y_test
RFReg_best.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
rm_score = np.sqrt(mean_squared_error(y_test, y_pred))
Or both of them are correct
Regarding (1), your understanding is indeed correct; a wording detail to be corrected in principle is " better final_score
" instead of "higher", as there are several performance metrics (everything measuring the error , such as MSE, MAE etc) that are the- lower -the-better.
Now, step (2) is more tricky; it requires taking a step back to check the whole procedure...
To start with, in general CV is used either for parameter tuning (your step 1) or for model assessment (ie what you are trying to do in step 2), which are different things indeed. Splitting from the very beginning your data into training & test sets as you have done here, and then sequentially performing the steps 1 (for parameter tuning) and 2b (model assessment in unseen data) is arguably the most "correct" procedure in principle (as for the bias you note in the comment, this is something we have to live with, since by default all our fitted models are "biased" toward the data used for their training, and this cannot be avoided).
Nevertheless, since early on, practitioners have been wondering if they can avoid "sacrificing" a part of their precious data only for testing (model assessment) purposes, and trying to see if they can actually skip the model assessment part (and the test set itself), using as model assessment the best results obtained from the parameter tuning procedure (your step 1). This is clearly cutting corners, but, as usually, the question is how off the actual results will be? and will it still be meaninful?
Again, in theory , what Vivek Kumar writes in his linked answer is correct:
If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.
But here is a relevant excerpt of the (highly recommended) Applied Predictive Modeling book (p. 78):
In short: if you use the whole X
in step 1 and consider the results of the tuning as model assessment, there will indeed be a bias/leakage, but it is usually small, at least for moderately large training sets...
Wrapping-up:
X
in step 1, and most probably you will still be within acceptable limits regarding your model assessment.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.