简体繁体 English

rpart 变量重要性显示的变量多于决策树图

[英]rpart variable importance shows more variables than decision tree plots

原文 2021-11-29 16:23:09 4 1 r/ caret/ rpart

I fitted an rpart model in Leave One Out Cross Validation on my data using Caret library in R.我使用 R 中的插入符号库在我的数据上安装了一个 rpart model 交叉验证。 Everything is ok, but I want to understand the difference between model's variable importance and decision tree plot.一切都很好，但我想了解模型的变量重要性和决策树 plot 之间的区别。

Calling the variable importance with the function varImp() shows nine variables.使用 function varImp()调用变量重要性显示九个变量。 Plotting the decision tree using functions such as fancyRpartPlot() or rpart.plot() shows a decision tree that uses only two variables to classify all subjects.使用诸如fancyRpartPlot()或rpart.plot()之类的函数绘制决策树显示了一个仅使用两个变量对所有主题进行分类的决策树。

How can it be?怎么会这样？ Why does the decision tree plot not shows the same nine variables from the variable importance table?为什么决策树 plot 没有从变量重要性表中显示相同的九个变量？

Thank you.谢谢你。

1 个解决方案

Similar to rpart() , Caret has a cool property: it deals with surrogate variables, ie variables that are not chosen for splits, but that were close to win the competition.与rpart()类似， Caret有一个很酷的属性：它处理代理变量，即未选择用于拆分但接近赢得竞争的变量。

Let me be more clear.让我更清楚。 Say at a given split, the algorithm decided to split on x1.假设在给定的拆分中，算法决定在 x1 上拆分。 Suppose also there is another variable, say x2, which would be almost as good as x1 for splitting at that stage.假设还有另一个变量，比如 x2，它在那个阶段几乎和 x1 一样好。 We call x2 surrogate , and we assign it its variable importance as we do for x1.我们称 x2 surrogate ，并像为 x1 所做的那样为其分配变量重要性。

This is way you can get in the importance ranking variables that are actually not used for splitting.通过这种方式，您可以获得实际上不用于拆分的重要性排名变量。 You can also find that such variables are more important than others actuall used!您还可以发现，这些变量比实际使用的其他变量更重要！

The rationale for this is explained in the documentation for rpart() : suppose we have two identical covariates, say x3 and x4. rpart()的文档中解释了这样做的基本原理：假设我们有两个相同的协变量，比如 x3 和 x4。 Then rpart() is likely to split on one of them only, eg, x3.然后rpart()可能只在其中一个上拆分，例如 x3。 How can we say that x4 is not important?我们怎么能说 x4 不重要呢？

To conclude, variable importance considers the increase in fit for both primary variables (ie, the ones actually chosen for splitting) and surrogate variables.总而言之，变量重要性考虑了主要变量（即实际选择用于拆分的变量）和代理变量的拟合度增加。 So, the importance for x1 considers both splits for which x1 is chosen as splitting variable, and splits for which another variables is chosen but x1 is a close competitor.因此，x1 的重要性考虑了选择 x1 作为拆分变量的拆分，以及选择了另一个变量但 x1 是紧密竞争者的拆分。

Hope this clarifies your doubts.希望这能澄清你的疑惑。 For more details, see here .有关更多详细信息，请参见此处。 Just a quick quotation:只是一个快速的报价：

The following methods for estimating the contribution of each variable to the model are available [speaking of how variable importance is computed]:以下方法可用于估计每个变量对 model 的贡献[谈到如何计算变量重要性]：

[...] [...]

- Recursive Partitioning: The reduction in the loss function (eg mean squared error) attributed to each variable at each split is tabulated and the sum is returned. - 递归分区：每次拆分时归因于每个变量的损失 function（例如均方误差）的减少被制成表格并返回总和。 Also, since there may be candidate variables that are important but are not used in a split, the top competing variables are also tabulated at each split.此外，由于可能存在重要但未在拆分中使用的候选变量，因此在每次拆分时也会列出最重要的竞争变量。 This can be turned off using the maxcompete argument in rpart.control.这可以使用 rpart.control 中的 maxcompete 参数关闭。

I am not used to caret , but from this quote it appears that such package actually uses rpart() to grow trees, thus inheriting the property about surrogate variables.我不习惯caret ，但从这个引用看来，这样的 package 实际上使用rpart()来种植树木，从而继承了关于代理变量的属性。