简体   繁体   中英

Ranking and scores in Recursive Feature Elimination (RFE) in scikit-learn

I am trying to understand how to read grid_scores_ and ranking_ values in RFECV . Here is the main example from the documentation:

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=5)
selector = selector.fit(X, y)
selector.support_ 
array([ True,  True,  True,  True,  True,
        False, False, False, False, False], dtype=bool)

selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])

How am I supposed to read ranking_ and grid_scores_ ? Is the lower the ranking value the better? (or viceversa?). The reason why ask this is because I have noticed that the features with the highest ranking value, have typically the highest scores in grid_scores_ .

However, if something has a ranking = 1 shouldn't that mean that it was ranked as the best of the group? . This is also what the documentation says :

" Selected (ie, estimated best) features are assigned rank 1 "

But now let's look at the following example using some real data:

> rfecv.grid_scores_[np.nonzero(rfecv.ranking_ == 1)[0]]
0.0

while the feature with the highest ranking value has the highest * score *.

> rfecv.grid_scores_[np.argmax(rfecv.ranking_ )]
0.997

Note that in the example above, the features with ranking=1 have the lowest score

Figure in the documentation:

On this matter, in this figure in the documentation, the y axis reads "number of misclassifications" , but it is plotting grid_scores_ which used 'accuracy' (?) as a scoring function. Shouldn't the y label read accuracy ? ( the higher the better ) instead of "number of misclassifications" ( the lower the better )

You are correct in that a low ranking value indicates a good feature and that a high cross-validation score in the grid_scores_ attribute is also good, however you are misinterpreting what the values in grid_scores_ mean. From the RFECV documentation

grid_scores_

array of shape [n_subsets_of_features]

The cross-validation scores such that grid_scores_[i] corresponds to the CV score of the i-th subset of features.

Thus the grid_scores_ values don't correspond to a particular feature, they are the cross-validation error metrics for subsets of features. In the example the subset with 5 features turns out to be the most informative set because the 5th value in grid_scores_ (the CV value for the SVR model incorporating the 5 most highly ranked features) is the largest.

You should also note that since the scoring metric is not explicitly specified, the scorer used is the default for SVR, which is R^2, not accuracy (which is only meaningful for classifiers).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM