简体繁体 English

caret rpart 决策树绘制结果

[英]caret rpart decision tree plotting result

原文 2020-01-09 04:55:20 7 2 r/ decision-tree/ r-caret/ rpart

I am training a decision tree model based on the heart disease data from Kaggle .我正在训练一个基于Kaggle心脏病数据的决策树模型。

Since I am also building other models using 10-fold CV, I am trying to use caret package with rpart method to build the tree.由于我也在使用 10 倍 CV 构建其他模型，因此我尝试使用带有 rpart 方法的 caret 包来构建树。 However, the plot result is weird as "thalium" should be a factor.然而，情节结果很奇怪，因为“铊”应该是一个因素。 Why does it show "thaliumnormal <0.5"?为什么显示“thaliumnormal <0.5”？ Does this mean that if "thalium" == normal" then take the left route "yes", otherwise right route "no"?这是否意味着如果“铊”==正常“然后走左边的路线“是”，否则走右边的路线“否”？

Many thanks!非常感谢！

Edits: I apologize for not providing enough background info, which seemed to cause some confusion.编辑：我很抱歉没有提供足够的背景信息，这似乎引起了一些混乱。 "thalium" is a variable that represents a technique used to detect coronary stenosis (aka narrowing). “铊”是一个变量，代表一种用于检测冠状动脉狭窄（又名狭窄）的技术。 It's a factor with three levels (normal, fixed defect, reversible defect).它是一个具有三个级别（正常、固定缺陷、可逆缺陷）的因素。

In addition, I would like to make the graph more readable eg instead of "thaliumnormal < 0.5", it should be something like "thalium = normal".此外，我想让图表更具可读性，例如代替“thliumnormal < 0.5”，它应该类似于“thlium = normal”。 I could achieve this goal through using rpart directly (see below).我可以通过直接使用 rpart 来实现这个目标（见下文）。

However, you probably have noticed that the tree is different, despite I used the recommended cp value with caret rpart CV 10 folds (see the code below).但是，您可能已经注意到树是不同的，尽管我使用了带有 caret rpart CV 10 折的推荐 cp 值（请参阅下面的代码）。

I understand that these two packages may result in some differences.我了解这两个包可能会导致一些差异。 Ideally, I could use caret with method rpart to build the tree so that it aligns with other models built in caret.理想情况下，我可以使用插入符号和方法 rpart 来构建树，以便它与插入符号中构建的其他模型对齐。 Does anyone know how I could make the plot label for the tree model built with caret rpart easier to understand?有谁知道我如何使用 caret rpart 构建的树模型的绘图标签更容易理解？

2 个解决方案

It would help if there were some data, like dput(head(data)) to show us what your data really looks like or a str(data) to show the levels of variables and data types.如果有一些数据会有所帮助，例如dput(head(data))向我们展示您的数据的真实情况或str(data)以显示变量和数据类型的级别。

But likely (without having seen it) the variable is thallium and one level is normal and the table has selected a LEVEL of the variable thallium and is evaluating, if something is that level normal or not.但很可能（没有看到它）变量是thallium ，一个水平是normal ，表格已经选择了变量thallium一个 LEVEL 并且正在评估该水平是否normal 。

The tree treats categorical variables as dummies by level and makes a decision based on being >= .5 or < .5 and 0 is always less and 1 is always more.该树将分类变量按级别视为虚拟变量，并根据 >= .5 或 < .5 做出决定，并且 0 始终较小，1 始终较多。

By design most tree algorithms choose the cut-off for each of the variables (including a dummy 0./1) that creates the most purity (moves the most observations to one side or another and closer to classification) and picks a point midway between the two values which create the greatest separation in groups.通过设计，大多数树算法为每个变量（包括虚拟 0./1）选择截止值，以创建最高纯度（将最多的观察移到一侧或另一侧并更接近分类）并在两者之间选择一个点在组中产生最大分离的两个值。

With a binary variable, that split is at .5 because it is midway between the two different values a level can take 0 and 1.对于二元变量，该分割为 0.5，因为它位于两个不同值的中间，一个级别可以取 0 和 1。

Your factor thaliumnormal is either 0 or 1, which represent yes or no - correct?您的因子thaliumnormal是 0 或 1，代表是或否 - 正确吗？

In that case, rpart takes the midvalue 0.5 so that all decision of 0 or 1 is either above or below 0.5 .在这种情况下， rpart取中间值 0.5 ，因此所有0或1决定要么高于或低于0.5 。

Values below the cut-off, in this case 0.5, always turns left.低于截止值的值（在本例中为 0.5）始终向左转。 So thaliumnormal==0 turns left, yes.所以thaliumnormal==0左转，是的。

You can see the same example as for sex你可以看到相同的例子，作为sex