简体   繁体   中英

What exactly does `eli5.show_weights` display for a classification model?

I used eli5 to apply the permutation procedure for feature importance. In the documentation , there is some explanation and a small example but it is not clear.

I am using a sklearn SVC model for a classification problem.

My question is: Are these weights the change (decrease/increase) of the accuracy when the specific feature is shuffled OR is it the SVC weights of these features?

In this medium article , the author states that these values show the reduction in model performance by the reshuffle of that feature. But not sure if that's indeed the case.

Small example:

from sklearn import datasets
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.svm import SVC, SVR

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target

clf = SVC(kernel='linear')
perms = PermutationImportance(clf, n_iter=1000, cv=10, scoring='accuracy').fit(X, y)

print(perms.feature_importances_)
print(perms.feature_importances_std_)

[0.38117333 0.16214   ]
[0.1349115  0.11182505]

eli5.show_weights(perms)

在此处输入图片说明

I did some deep research. After going through the source code here is what I believe for the case where cv is used and is not prefit or None . I use a K-Folds scheme for my application. I also use a SVC model thus, score is the accuracy in this case.

By looking at the fit method of the PermutationImportance object, the _cv_scores_importances are computed ( https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/permutation_importance.py#L202 ). The specified cross-validation scheme is used and the base_scores, feature_importances are returned using the test data (function: _get_score_importances inside _cv_scores_importances ).

By looking at get_score_importances function ( https://github.com/TeamHG-Memex/eli5/blob/master/eli5/permutation_importance.py#L55 ), we can see that base_score is the score on the non shuffled data and feature_importances (called differently there as: scores_decreases ) are defined as non shuffled score - shuffled score (see https://github.com/TeamHG-Memex/eli5/blob/master/eli5/permutation_importance.py#L93 )

Finally, the errors ( feature_importances_std_ ) are the SD of the above feature_importances ( https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/permutation_importance.py#L209 ) and the feature_importances_ is the mean of the above feature_importances (non-shuffled score minus (-) shuffled score).

A fair bit shorter answer to your original question, regardless of the setting for the cv parameter, eli5 will calculate the average decrease in the scorer you provide. Because you're using the sklearn wrapper, the scorer will come from scikit-learn: in your case accuracy . Overall as a word on the package, some of these details are particularly difficult to figure out without going into the deeper into the source code, might be worth trying to submit a pull request to make the documentation more detailed where possible.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM