I am using sklearn.feature_selection.RFECV
:
ref = RFECV(lr, step=1, cv =5, scoring="r2")
ref.fit(X_ndarr, y_ndarr)
print(ref.grid_scores_)
I get:
[ 0.9316829 0.93472609 0.79440118 -2.37744438 -1.20559428 -1.35899883 -0.90087801 -1.02047363 -0.54169276 -0.08116821 -0.00685128 0.1561999 -0.26433411 -0.27843449 -0.32703359 -0.32782641 -0.30881354 0.11878835 0.08175137 0.04300757
0.0378917 0.04534877]
RFECV
removes the least important feature at each step, so the score for eg 10 features should be the best achieved score for any 10 features, while when I run the code below using a selected 10 feature (using another way):
from sklearn.model_selection import cross_val_score
lr = linear_model.LinearRegression()
scores = cross_val_score(lr, X_top10_ndarr, y_ndarr, cv=5) # top10 features
Then I get:
cross-validation scores: [0.96706997 0.9653103 0.96386666 0.96017565 0.96603127]
All of the scores are around 0.96 , while the score with 10 features from RFECV
is -0.08 .
What exactly is happening here?
EDIT1 : The number of selected features is 2
and the ranking_
is as follows:
[ 4 7 1 6 3 2 8 11 5 10 21 9 12 14 13 15 16 19 18 17 1 20]
ref.grid_scores_
represent the cross-validation scores such that grid_scores_[i] corresponds to the CV score of the i-th subset of features.
Refer to this answer for more understanding of these values.
Going by that explanation, the model's cv score for 10 features would be -0.26433411
Having said that, the score is really bad since it is negative probably linear models may not be good for your data set.
one more point to note is that even will all the features, you have go only 0.9316829 which is less than 0.96.
May be set a random_state
with StratifiedKFold and feed that as a cv param value.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.