I have a fitted model ( clf
) using sklearn.ensemble.RandomForestClassifier
. I already know that I can get the feature importances with clf.feature_importances_
. Anyway, I would like to know, if it's possible, how to get feature importances but by each sample.
Example:
from sklearn.ensemble import RandomForestClassifier
X = {"f1":[0,1,1,0,1], "f2":[1,1,1,0,1]}
y = [0,1,0,1,0]
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)
y_pred = clf.predict(X)
Then, how do I get something like this:
y_pred f1_importance f2_importance
1 0.57 0.43
1 0.26 0.74
1 0.31 0.69
0 0.62 0.38
1 0.16 0.84
* y_pred
values aren't real. I'm actually using pandas
for the real project in Python 3.8.
You can use the treeinterpreter
to get the feature importance for individual predictions of your RandomForestClassifier
You can find the treeinterpreter
on Github and install it via
pip install treeinterpreter
I used your reference code but had to adjust it, because you can not use a dictionary as input to fit your RandomForestClassifier
:
from sklearn.ensemble import RandomForestClassifier
from treeinterpreter import treeinterpreter as ti
import numpy as np
import pandas as pd
X = np.array([[0,1],[1,1],[1,1],[0,0],[1,1]])
y = [0,1,0,1,0]
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)
y_pred = clf.predict(X)
y_pred_probas = clf.predict_proba(X)
Then I used the treeinterpreter with your classifier and data to compute the bias, contributions and also the prediction values:
prediction, bias, contributions = ti.predict(clf, X)
df = pd.DataFrame(data=np.matrix([y_pred, prediction.transpose()[0], prediction.transpose()[1], np.sum(contributions, axis=1).transpose()[0], bias.transpose()[0], np.sum(contributions, axis=1).transpose()[1], bias.transpose()[1]]).transpose(), columns=["Prediction", "Prediction value 0", "Prediction value 1", "f1_contribution", "f1_bias", "f2_contribution","f2_bias"])
df
Output
You can have a look at this blogpost by the author, to understand better how it works.
In the table the prediction value for 0 and 1 refers to the probability for both classes, which you can also compute by using the existing predict_proba() method of RandomForestClassifier
.
You can verify, that the bias and contributions add up to the prediction value/probability like this:
bias + np.sum(contributions, axis=1)
Output
array([[0.744 , 0.256 ],
[0.6565, 0.3435],
[0.6565, 0.3435],
[0.214 , 0.786 ],
[0.6565, 0.3435]])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.