Determine WHY Features Are Important in Decision Tree Models

Question

Often-times stakeholders don't want a black-box model that's good at predicting; they want insights about features to have a better understanding about their business, and so they can explain it to others.

When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?

Is there a way to explain not only what features are important but also WHY they're important?

I was told to use shap but running even some of the boilerplate examples throws errors so I'm looking for alternatives (or even just a procedural way to inspect trees and glean insights I can take away other than a plot_importance() plot).

In the example below, how does one go about explaining WHY feature f19 is the most important (while also realizing that decision trees are random without a random_state or seed).

from xgboost import XGBClassifier, plot_importance
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
X,y = make_classification(random_state=68)
xgb = XGBClassifier()
xgb.fit(X, y)
plot_importance(xgb)
plt.show()

Update: What I'm looking for is a programmatic procedural proof that the features chosen by the model above contribute either positively or negatively to the predictive power. I want to see code (not theory) of how you would go about inspecting the actual model and determining each feature's positive or negative contribution. Currently, I maintain that it's not possible so somebody please prove me wrong. I'd love to be wrong!

I also understand that decision trees are non-parametric and have no coefficients. Still, is there a way to see if a feature contributes positively (one unit of this feature increases y) or negatively (one unit of this feature decreases y).

Update2: Despite a thumbs down on this question, and several "close" votes, it seems this question isn't so crazy after all. Partial dependence plots might be the answer.

Partial Dependence Plots (PDP) were introduced by Friedman (2001) with purpose of interpreting complex Machine Learning algorithms. Interpreting a linear regression model is not as complicated as interpreting Support Vector Machine, Random Forest or Gradient Boosting Machine models, this is were Partial Dependence Plot can come into use. For some statistical explaination you can refer hereand More Advance. Some of the algorithms have methods for finding variable importance but they do not express whether a varaible is positively or negatively affecting the model .

Answer 1

The "importance" of a feature depends on the algorithm you are using to build the trees. In C4.5 trees, for example, a maximum-entropy criterion is often used. This means that the feature set is the one that allows classification with the fewer decision steps.

Answer 2

When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?

Yes we do. Feature importance is not some magical object, it is a well defined mathematical criterion - its exact definition depends on particular model (and/or some additional choices), but it is always an object which tells "why". The "why" is usually the most basic thing possible, and boils down to "because it has the strongest predictive power". For example for random forest feature importance is a measure of how probable it is for this feature to be used on a decision path when randomly selected training data point is pushed through the tree. So it gives "why" in a proper, mathematical sense.

Answer 3

tldr; http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html

I'd like to clear up some of the wording to make sure we're on the same page.

Predictive power : what features significantly contribute to the prediction
Feature dependence : are the features positively or negatively correlated, ie, does a change in the feature X cause the prediction y to increase/decrease

1. Predictive power

Your feature importance shows you what retains the most information, and are the most significant features. Power could imply what causes the biggest change - you would have to check by plugging in dummy values to see their overall impact, much like you would have to do with linear regression coefficients.

2. Correlation/Dependence

As pointed out by @Tiago1984, it depends heavily on the underlying algorithm. XGBoost/GBM are additively building a committee of stubs (decision trees with a low number of trees, usually only one split).

In a regression problem, the trees are typically using a criterion related to the MSE. I won't go into the full details, but you can read more here: https://medium.com/towards-data-science/boosting-algorithm-gbm-97737c63daa3 .

You'll see that at each step it calculates a vector for the "direction" of the weak learner, so you in principle know the direction of the influence from it (but keep in mind it may appear many times in one tree, in multiple steps of the additive model).

But, to cut to the chase; you could just fix all your features apart from f19 and make a prediction for a range of f19 values and see how it is related to the response value.

Take a look at partial dependency plots : http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html

There's also a chapter on it in Elements of Statistical Learning , Chapter 10.13.2.

Determine WHY Features Are Important in Decision Tree Models

Question

3 answers

solution1
1 2017-11-04 01:20:56

solution2
1 2017-11-04 01:58:03

solution3
1 ACCPTED 2017-11-05 15:32:13

Determine WHY Features Are Important in Decision Tree Models

Question

3 answers

solution1 1 2017-11-04 01:20:56

solution2 1 2017-11-04 01:58:03

solution3 1 ACCPTED 2017-11-05 15:32:13

solution1
1 2017-11-04 01:20:56

solution2
1 2017-11-04 01:58:03

solution3
1 ACCPTED 2017-11-05 15:32:13