简体   繁体   English

sklearn中的RFECV,来自grid_scores_的分数

[英]RFECV in sklearn, scores from grid_scores_

I am using sklearn.feature_selection.RFECV : 我正在使用sklearn.feature_selection.RFECV

ref = RFECV(lr, step=1, cv =5, scoring="r2")
ref.fit(X_ndarr, y_ndarr)
print(ref.grid_scores_)

I get: 我得到:

[ 0.9316829 0.93472609 0.79440118 -2.37744438 -1.20559428 -1.35899883 -0.90087801 -1.02047363 -0.54169276 -0.08116821 -0.00685128 0.1561999 -0.26433411 -0.27843449 -0.32703359 -0.32782641 -0.30881354 0.11878835 0.08175137 0.04300757 [0.9316829 0.93472609 0.79440118 -2.37744438 -1.20559428 -1.35899883 -0.90087801 -1.02047363 -0.54169276 -0.08116821 -0.00685128 0.1561999 -0.26433411 -0.27843449 -0.32703359 -0.32782641 -0.30881354 0.11878835 0.08175137 0.04300757
0.0378917 0.04534877] 0.0378917 0.04534877]

RFECV removes the least important feature at each step, so the score for eg 10 features should be the best achieved score for any 10 features, while when I run the code below using a selected 10 feature (using another way): RFECV在每个步骤中删除了最不重要的功能,因此,例如10个功能的得分应该是任何10个功能的最佳得分,而当我使用选定的10个功能(使用另一种方式)运行以下代码时:

from sklearn.model_selection import cross_val_score
lr = linear_model.LinearRegression()
scores = cross_val_score(lr, X_top10_ndarr, y_ndarr, cv=5) # top10 features

Then I get: 然后我得到:

cross-validation scores: [0.96706997 0.9653103 0.96386666 0.96017565 0.96603127] 交叉验证得分:[0.96706997 0.9653103 0.96386666 0.96017565 0.96603127]

All of the scores are around 0.96 , while the score with 10 features from RFECV is -0.08 . 所有得分均为0.96 ,而带有RFECV 10个功能的得分为-0.08

What exactly is happening here? 这里到底发生了什么?

EDIT1 : The number of selected features is 2 and the ranking_ is as follows: EDIT1 :所选功能的数量为2ranking_如下:

[ 4 7 1 6 3 2 8 11 5 10 21 9 12 14 13 15 16 19 18 17 1 20] [4 7 1 6 3 2 8 11 5 10 21 9 12 14 13 15 16 19 18 17 1 20]

ref.grid_scores_ represent the cross-validation scores such that grid_scores_[i] corresponds to the CV score of the i-th subset of features. ref.grid_scores_表示交叉验证得分,以使grid_scores_ [i]对应于第i个特征子集的CV得分。

Refer to this answer for more understanding of these values. 请参阅答案以进一步了解这些值。

Going by that explanation, the model's cv score for 10 features would be -0.26433411 根据该解释,该模型的10个功能的简历得分为-0.26433411

Having said that, the score is really bad since it is negative probably linear models may not be good for your data set. 话虽如此,分数确实很差,因为它是负数,可能线性模型可能对您的数据集不利。

one more point to note is that even will all the features, you have go only 0.9316829 which is less than 0.96. 还有一点要注意的是,即使将所有功能都包括在内,您只能得到0.9316829,而小于0.96。

May be set a random_state with StratifiedKFold and feed that as a cv param value. 可以使用StratifiedKFold设置random_state并将其作为简历参数值输入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM