简体   繁体   中英

Getting feature importance by sample - Python Scikit Learn

I have a fitted model ( clf ) using sklearn.ensemble.RandomForestClassifier . I already know that I can get the feature importances with clf.feature_importances_ . Anyway, I would like to know, if it's possible, how to get feature importances but by each sample.

Example:

from sklearn.ensemble import RandomForestClassifier

X = {"f1":[0,1,1,0,1], "f2":[1,1,1,0,1]}
y = [0,1,0,1,0]

clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)

y_pred = clf.predict(X)

Then, how do I get something like this:

y_pred f1_importance f2_importance
   1         0.57          0.43          
   1         0.26          0.74
   1         0.31          0.69
   0         0.62          0.38
   1         0.16          0.84

* y_pred values aren't real. I'm actually using pandas for the real project in Python 3.8.

You can use the treeinterpreter to get the feature importance for individual predictions of your RandomForestClassifier

You can find the treeinterpreter on Github and install it via

pip install treeinterpreter

I used your reference code but had to adjust it, because you can not use a dictionary as input to fit your RandomForestClassifier :

from sklearn.ensemble import RandomForestClassifier
from treeinterpreter import treeinterpreter as ti
import numpy as np
import pandas as pd

X = np.array([[0,1],[1,1],[1,1],[0,0],[1,1]])
y = [0,1,0,1,0]

clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)

y_pred = clf.predict(X)
y_pred_probas = clf.predict_proba(X)

Then I used the treeinterpreter with your classifier and data to compute the bias, contributions and also the prediction values:

prediction, bias, contributions = ti.predict(clf, X)

df = pd.DataFrame(data=np.matrix([y_pred, prediction.transpose()[0], prediction.transpose()[1], np.sum(contributions, axis=1).transpose()[0], bias.transpose()[0], np.sum(contributions, axis=1).transpose()[1], bias.transpose()[1]]).transpose(), columns=["Prediction", "Prediction value 0", "Prediction value 1", "f1_contribution", "f1_bias", "f2_contribution","f2_bias"])

df

Output

在此处输入图片说明

You can have a look at this blogpost by the author, to understand better how it works.

In the table the prediction value for 0 and 1 refers to the probability for both classes, which you can also compute by using the existing predict_proba() method of RandomForestClassifier .

You can verify, that the bias and contributions add up to the prediction value/probability like this:

bias + np.sum(contributions, axis=1)

Output

array([[0.744 , 0.256 ],
       [0.6565, 0.3435],
       [0.6565, 0.3435],
       [0.214 , 0.786 ],
       [0.6565, 0.3435]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM