[英]Getting feature importance by sample - Python Scikit Learn
I have a fitted model ( clf
) using sklearn.ensemble.RandomForestClassifier
.我有一个使用
sklearn.ensemble.RandomForestClassifier
的拟合模型( clf
)。 I already know that I can get the feature importances with clf.feature_importances_
.我已经知道我可以使用
clf.feature_importances_
获得特征重要性。 Anyway, I would like to know, if it's possible, how to get feature importances but by each sample.无论如何,我想知道,如果可能的话,如何通过每个样本获取特征重要性。
Example:例子:
from sklearn.ensemble import RandomForestClassifier
X = {"f1":[0,1,1,0,1], "f2":[1,1,1,0,1]}
y = [0,1,0,1,0]
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)
y_pred = clf.predict(X)
Then, how do I get something like this:然后,我如何得到这样的东西:
y_pred f1_importance f2_importance
1 0.57 0.43
1 0.26 0.74
1 0.31 0.69
0 0.62 0.38
1 0.16 0.84
* y_pred
values aren't real. *
y_pred
值不是真实的。 I'm actually using pandas
for the real project in Python 3.8.我实际上是在 Python 3.8 中将
pandas
用于实际项目。
You can use the treeinterpreter
to get the feature importance for individual predictions of your RandomForestClassifier
您可以使用
treeinterpreter
得到您的个人预测的功能重要性RandomForestClassifier
You can find the treeinterpreter
on Github and install it via您可以在Github上找到
treeinterpreter
并通过以下方式安装它
pip install treeinterpreter
I used your reference code but had to adjust it, because you can not use a dictionary as input to fit your RandomForestClassifier
:我使用了您的参考代码,但不得不对其进行调整,因为您不能使用字典作为输入来适应您的
RandomForestClassifier
:
from sklearn.ensemble import RandomForestClassifier
from treeinterpreter import treeinterpreter as ti
import numpy as np
import pandas as pd
X = np.array([[0,1],[1,1],[1,1],[0,0],[1,1]])
y = [0,1,0,1,0]
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)
y_pred = clf.predict(X)
y_pred_probas = clf.predict_proba(X)
Then I used the treeinterpreter with your classifier and data to compute the bias, contributions and also the prediction values:然后我使用带有分类器和数据的 treeinterpreter 来计算偏差、贡献以及预测值:
prediction, bias, contributions = ti.predict(clf, X)
df = pd.DataFrame(data=np.matrix([y_pred, prediction.transpose()[0], prediction.transpose()[1], np.sum(contributions, axis=1).transpose()[0], bias.transpose()[0], np.sum(contributions, axis=1).transpose()[1], bias.transpose()[1]]).transpose(), columns=["Prediction", "Prediction value 0", "Prediction value 1", "f1_contribution", "f1_bias", "f2_contribution","f2_bias"])
df
Output输出
You can have a look at this blogpost by the author, to understand better how it works.您可以查看作者的这篇博文,以更好地了解它的工作原理。
In the table the prediction value for 0 and 1 refers to the probability for both classes, which you can also compute by using the existing predict_proba() method of RandomForestClassifier
.在表中,0 和 1 的预测值是指两个类的概率,您也可以使用RandomForestClassifier的现有predict_proba()方法
RandomForestClassifier
。
You can verify, that the bias and contributions add up to the prediction value/probability like this:您可以验证,偏差和贡献加起来像这样的预测值/概率:
bias + np.sum(contributions, axis=1)
Output输出
array([[0.744 , 0.256 ],
[0.6565, 0.3435],
[0.6565, 0.3435],
[0.214 , 0.786 ],
[0.6565, 0.3435]])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.