简体繁体中英

rpart variable importance shows more variables than decision tree plots

原文 2021-11-29 16:23:09 1 1 r/ caret/ rpart

I fitted an rpart model in Leave One Out Cross Validation on my data using Caret library in R. Everything is ok, but I want to understand the difference between model's variable importance and decision tree plot.

Calling the variable importance with the function varImp() shows nine variables. Plotting the decision tree using functions such as fancyRpartPlot() or rpart.plot() shows a decision tree that uses only two variables to classify all subjects.

How can it be? Why does the decision tree plot not shows the same nine variables from the variable importance table?

Thank you.

1 answers

Similar to rpart() , Caret has a cool property: it deals with surrogate variables, ie variables that are not chosen for splits, but that were close to win the competition.

Let me be more clear. Say at a given split, the algorithm decided to split on x1. Suppose also there is another variable, say x2, which would be almost as good as x1 for splitting at that stage. We call x2 surrogate , and we assign it its variable importance as we do for x1.

This is way you can get in the importance ranking variables that are actually not used for splitting. You can also find that such variables are more important than others actuall used!

The rationale for this is explained in the documentation for rpart() : suppose we have two identical covariates, say x3 and x4. Then rpart() is likely to split on one of them only, eg, x3. How can we say that x4 is not important?

To conclude, variable importance considers the increase in fit for both primary variables (ie, the ones actually chosen for splitting) and surrogate variables. So, the importance for x1 considers both splits for which x1 is chosen as splitting variable, and splits for which another variables is chosen but x1 is a close competitor.

Hope this clarifies your doubts. For more details, see here . Just a quick quotation:

The following methods for estimating the contribution of each variable to the model are available [speaking of how variable importance is computed]:

[...]

- Recursive Partitioning: The reduction in the loss function (eg mean squared error) attributed to each variable at each split is tabulated and the sum is returned. Also, since there may be candidate variables that are important but are not used in a split, the top competing variables are also tabulated at each split. This can be turned off using the maxcompete argument in rpart.control.

I am not used to caret , but from this quote it appears that such package actually uses rpart() to grow trees, thus inheriting the property about surrogate variables.

How do I plot the Variable Importance of my trained rpart decision tree model?

Extract variable labels from rpart decision tree

Decision Tree and Feature Importance: Why does the decision tree not show the importance of all variables?

Getting "Variable Importance" from rpart

Using rpart Package in R, error selecting all variables for decision tree model

Plotting a rpart decision tree modell with the sunburst view

ROC Curve in R with rpart for a decision tree

Labeling issues for rpart in decision tree in R

caret rpart decision tree plotting result

Data Prediction using Decision Tree of rpart

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How do I plot the Variable Importance of my trained rpart decision tree model? Extract variable labels from rpart decision tree Decision Tree and Feature Importance: Why does the decision tree not show the importance of all variables? Getting "Variable Importance" from rpart Using rpart Package in R, error selecting all variables for decision tree model Plotting a rpart decision tree modell with the sunburst view ROC Curve in R with rpart for a decision tree Labeling issues for rpart in decision tree in R caret rpart decision tree plotting result Data Prediction using Decision Tree of rpart

Related Tags

rpart variable importance shows more variables than decision tree plots

Question

1 answers

solution1 0 2022-01-06 21:19:06

solution1
0 2022-01-06 21:19:06