简体   繁体   English

确定为什么特征在决策树模型中很重要

[英]Determine WHY Features Are Important in Decision Tree Models

Often-times stakeholders don't want a black-box model that's good at predicting; 通常情况下,利益相关者不希望使用擅长预测的黑盒模型; they want insights about features to have a better understanding about their business, and so they can explain it to others. 他们希望能够深入了解功能,以便更好地了解他们的业务,因此他们可以向其他人解释。

When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we? 当我们检查xgboost或sklearn梯度增强模型的特征重要性时,我们可以确定特征重要性......但我们不明白为什么这些特征很重要,对吗?

Is there a way to explain not only what features are important but also WHY they're important? 有没有办法解释不仅哪些功能很重要,还有它们为什么重要?

I was told to use shap but running even some of the boilerplate examples throws errors so I'm looking for alternatives (or even just a procedural way to inspect trees and glean insights I can take away other than a plot_importance() plot). 有人告诉我使用shap,但运行甚至一些样板示例会引发错误,所以我正在寻找替代方案(甚至只是一种程序性的方法来检查树和收集的见解,除了plot_importance()图之外我可以带走)。

In the example below, how does one go about explaining WHY feature f19 is the most important (while also realizing that decision trees are random without a random_state or seed). 在下面的示例中,如何解释为什么特征f19是最重要的(同时还意识到决策树是随机的,没有random_state或种子)。

from xgboost import XGBClassifier, plot_importance
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
X,y = make_classification(random_state=68)
xgb = XGBClassifier()
xgb.fit(X, y)
plot_importance(xgb)
plt.show()

feature_importance

Update: What I'm looking for is a programmatic procedural proof that the features chosen by the model above contribute either positively or negatively to the predictive power. 更新:我正在寻找的是程序化程序证明,上述模型选择的特征对预测能力有正面或负面影响。 I want to see code (not theory) of how you would go about inspecting the actual model and determining each feature's positive or negative contribution. 我想看看代码 (不是理论),你将如何检查实际模型并确定每个特征的正面或负面贡献。 Currently, I maintain that it's not possible so somebody please prove me wrong. 目前,我认为这是不可能的,所以有人请证明我错了。 I'd love to be wrong! 我爱是错的!

I also understand that decision trees are non-parametric and have no coefficients. 我也理解决策树是非参数的并且没有系数。 Still, is there a way to see if a feature contributes positively (one unit of this feature increases y) or negatively (one unit of this feature decreases y). 仍然有一种方法可以看出一个特征是否有积极贡献(该特征的一个单位增加y)或负面(该特征的一个单位减少y)。

Update2: Despite a thumbs down on this question, and several "close" votes, it seems this question isn't so crazy after all. Update2:尽管对这个问题大加赞赏 ,还有几个“近距离”投票,看来这个问题毕竟不是那么疯狂。 Partial dependence plots might be the answer. 部分依赖图可能是答案。

Partial Dependence Plots (PDP) were introduced by Friedman (2001) with purpose of interpreting complex Machine Learning algorithms. Friedman(2001)引入了部分依赖图(PDP),目的是解释复杂的机器学习算法。 Interpreting a linear regression model is not as complicated as interpreting Support Vector Machine, Random Forest or Gradient Boosting Machine models, this is were Partial Dependence Plot can come into use. 解释线性回归模型并不像解释支持向量机,随机森林或梯度增强机模型那么复杂,这是部分依赖图可以使用。 For some statistical explaination you can refer hereand More Advance. 对于一些统计解释,您可以在这里参考和更多进展。 Some of the algorithms have methods for finding variable importance but they do not express whether a varaible is positively or negatively affecting the model . 一些算法具有用于发现变量重要性的方法,但是它们不表示变量是否对模型产生 正面负面影响。

The "importance" of a feature depends on the algorithm you are using to build the trees. 功能的“重要性”取决于您用于构建树的算法。 In C4.5 trees, for example, a maximum-entropy criterion is often used. 例如,在C4.5树中,经常使用最大熵标准。 This means that the feature set is the one that allows classification with the fewer decision steps. 这意味着功能集允许使用较少的决策步骤进行分类。

When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we? 当我们检查xgboost或sklearn梯度增强模型的特征重要性时,我们可以确定特征重要性......但我们不明白为什么这些特征很重要,对吗?

Yes we do. 是的我们做到了。 Feature importance is not some magical object, it is a well defined mathematical criterion - its exact definition depends on particular model (and/or some additional choices), but it is always an object which tells "why". 特征重要性不是一些神奇的对象,它是一个明确定义的数学标准 - 它的确切定义取决于特定的模型(和/或一些其他选择),但它始终是一个告诉“为什么”的对象。 The "why" is usually the most basic thing possible, and boils down to "because it has the strongest predictive power". “为什么”通常是最基本的事情,归结为“因为它具有最强的预测能力”。 For example for random forest feature importance is a measure of how probable it is for this feature to be used on a decision path when randomly selected training data point is pushed through the tree. 例如,对于随机森林特征,重要性是当在树中推送随机选择的训练数据点时,在决策路径上使用该特征的可能性的度量。 So it gives "why" in a proper, mathematical sense. 所以它给出了“为什么”在适当的数学意义上。

tldr; tldr; http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html


I'd like to clear up some of the wording to make sure we're on the same page. 我想澄清一些措辞,以确保我们在同一页面上。

  1. Predictive power : what features significantly contribute to the prediction 预测能力 :哪些特征对预测有显着贡献
  2. Feature dependence : are the features positively or negatively correlated, ie, does a change in the feature X cause the prediction y to increase/decrease 特征依赖性 :特征正或负相关,即特征X的变化是否导致预测y增加/减少

1. Predictive power 1.预测能力

Your feature importance shows you what retains the most information, and are the most significant features. 您的功能重要性向您显示保留最多信息的内容,并且是最重要的功能。 Power could imply what causes the biggest change - you would have to check by plugging in dummy values to see their overall impact, much like you would have to do with linear regression coefficients. 权力可能意味着导致最大变化的原因 - 你必须通过插入虚拟值来检查它们的整体影响,就像你必须使用线性回归系数一样。

2. Correlation/Dependence 2.相关/依赖

As pointed out by @Tiago1984, it depends heavily on the underlying algorithm. 正如@ Tiago1984所指出的,它在很大程度上取决于底层算法。 XGBoost/GBM are additively building a committee of stubs (decision trees with a low number of trees, usually only one split). XGBoost / GBM正在累加建立一个存根委员会(具有少量树木的决策树,通常只有一个分裂)。

In a regression problem, the trees are typically using a criterion related to the MSE. 在回归问题中,树通常使用与MSE相关的标准。 I won't go into the full details, but you can read more here: https://medium.com/towards-data-science/boosting-algorithm-gbm-97737c63daa3 . 我不会详细介绍,但您可以在此处阅读更多内容: https//medium.com/towards-data-science/boosting-algorithm-gbm-97737c63daa3

You'll see that at each step it calculates a vector for the "direction" of the weak learner, so you in principle know the direction of the influence from it (but keep in mind it may appear many times in one tree, in multiple steps of the additive model). 你会看到它在每一步都为弱学习者的“方向”计算一个向量,所以你原则上知道它的影响方向(但要记住它可能在一棵树中多次出现,多次出现添加剂模型的步骤)。

But, to cut to the chase; 但是,要切入追逐; you could just fix all your features apart from f19 and make a prediction for a range of f19 values and see how it is related to the response value. 您可以修复除f19之外的所有功能,并预测一系列f19值,并查看它与响应值的关系。

Take a look at partial dependency plots : http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html 看看部分依赖图http//scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html

There's also a chapter on it in Elements of Statistical Learning , Chapter 10.13.2. “统计学习要素”第10.13.2章还有一章。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM