Getting feature importance by sample - Python Scikit Learn

Question

I have a fitted model ( clf ) using sklearn.ensemble.RandomForestClassifier . I already know that I can get the feature importances with clf.feature_importances_ . Anyway, I would like to know, if it's possible, how to get feature importances but by each sample.

Example:

from sklearn.ensemble import RandomForestClassifier

X = {"f1":[0,1,1,0,1], "f2":[1,1,1,0,1]}
y = [0,1,0,1,0]

clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)

y_pred = clf.predict(X)

Then, how do I get something like this:

y_pred f1_importance f2_importance
   1         0.57          0.43          
   1         0.26          0.74
   1         0.31          0.69
   0         0.62          0.38
   1         0.16          0.84

* y_pred values aren't real. I'm actually using pandas for the real project in Python 3.8.

Answer 1

You can use the treeinterpreter to get the feature importance for individual predictions of your RandomForestClassifier

You can find the treeinterpreter on Github and install it via

pip install treeinterpreter

I used your reference code but had to adjust it, because you can not use a dictionary as input to fit your RandomForestClassifier :

from sklearn.ensemble import RandomForestClassifier
from treeinterpreter import treeinterpreter as ti
import numpy as np
import pandas as pd

X = np.array([[0,1],[1,1],[1,1],[0,0],[1,1]])
y = [0,1,0,1,0]

clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)

y_pred = clf.predict(X)
y_pred_probas = clf.predict_proba(X)

Then I used the treeinterpreter with your classifier and data to compute the bias, contributions and also the prediction values:

prediction, bias, contributions = ti.predict(clf, X)

df = pd.DataFrame(data=np.matrix([y_pred, prediction.transpose()[0], prediction.transpose()[1], np.sum(contributions, axis=1).transpose()[0], bias.transpose()[0], np.sum(contributions, axis=1).transpose()[1], bias.transpose()[1]]).transpose(), columns=["Prediction", "Prediction value 0", "Prediction value 1", "f1_contribution", "f1_bias", "f2_contribution","f2_bias"])

df

Output

You can have a look at this blogpost by the author, to understand better how it works.

In the table the prediction value for 0 and 1 refers to the probability for both classes, which you can also compute by using the existing predict_proba() method of RandomForestClassifier .

You can verify, that the bias and contributions add up to the prediction value/probability like this:

bias + np.sum(contributions, axis=1)

Output

array([[0.744 , 0.256 ],
       [0.6565, 0.3435],
       [0.6565, 0.3435],
       [0.214 , 0.786 ],
       [0.6565, 0.3435]])

Getting feature importance by sample - Python Scikit Learn

Question

1 answers

solution1
3 2020-10-20 10:14:34

Getting feature importance by sample - Python Scikit Learn

Question

1 answers

solution1 3 2020-10-20 10:14:34

solution1
3 2020-10-20 10:14:34