简体   繁体   English

caret rpart 决策树绘制结果

[英]caret rpart decision tree plotting result

I am training a decision tree model based on the heart disease data from Kaggle .我正在训练一个基于Kaggle心脏病数据的决策树模型。

Since I am also building other models using 10-fold CV, I am trying to use caret package with rpart method to build the tree.由于我也在使用 10 倍 CV 构建其他模型,因此我尝试使用带有 rpart 方法的 caret 包来构建树。 However, the plot result is weird as "thalium" should be a factor.然而,情节结果很奇怪,因为“铊”应该是一个因素。 Why does it show "thaliumnormal <0.5"?为什么显示“thaliumnormal <0.5”? Does this mean that if "thalium" == normal" then take the left route "yes", otherwise right route "no"?这是否意味着如果“铊”==正常“然后走左边的路线“是”,否则走右边的路线“否”?

Many thanks!非常感谢!

使用fancyRpartPlot的插入符号rpart决策树图

Edits: I apologize for not providing enough background info, which seemed to cause some confusion.编辑:我很抱歉没有提供足够的背景信息,这似乎引起了一些混乱。 "thalium" is a variable that represents a technique used to detect coronary stenosis (aka narrowing). “铊”是一个变量,代表一种用于检测冠状动脉狭窄(又名狭窄)的技术。 It's a factor with three levels (normal, fixed defect, reversible defect).它是一个具有三个级别(正常、固定缺陷、可逆缺陷)的因素。

数据结构

In addition, I would like to make the graph more readable eg instead of "thaliumnormal < 0.5", it should be something like "thalium = normal".此外,我想让图表更具可读性,例如代替“thliumnormal < 0.5”,它应该类似于“thlium = normal”。 I could achieve this goal through using rpart directly (see below).我可以通过直接使用 rpart 来实现这个目标(见下文)。

rpart 决策树图

However, you probably have noticed that the tree is different, despite I used the recommended cp value with caret rpart CV 10 folds (see the code below).但是,您可能已经注意到树是不同的,尽管我使用了带有 caret rpart CV 10 折的推荐 cp 值(请参阅下面的代码)。

代码 推荐的cp,用于使用fancyRpartplot的rpart树

I understand that these two packages may result in some differences.我了解这两个包可能会导致一些差异。 Ideally, I could use caret with method rpart to build the tree so that it aligns with other models built in caret.理想情况下,我可以使用插入符号和方法 rpart 来构建树,以便它与插入符号中构建的其他模型对齐。 Does anyone know how I could make the plot label for the tree model built with caret rpart easier to understand?有谁知道我如何使用 caret rpart 构建的树模型的绘图标签更容易理解?

It would help if there were some data, like dput(head(data)) to show us what your data really looks like or a str(data) to show the levels of variables and data types.如果有一些数据会有所帮助,例如dput(head(data))向我们展示您的数据的真实情况或str(data)以显示变量和数据类型的级别。

But likely (without having seen it) the variable is thallium and one level is normal and the table has selected a LEVEL of the variable thallium and is evaluating, if something is that level normal or not.但很可能(没有看到它)变量是thallium ,一个水平是normal ,表格已经选择了变量thallium一个 LEVEL 并且正在评估该水平是否normal

The tree treats categorical variables as dummies by level and makes a decision based on being >= .5 or < .5 and 0 is always less and 1 is always more.该树将分类变量按级别视为虚拟变量,并根据 >= .5 或 < .5 做出决定,并且 0 始终较小,1 始终较多。

By design most tree algorithms choose the cut-off for each of the variables (including a dummy 0./1) that creates the most purity (moves the most observations to one side or another and closer to classification) and picks a point midway between the two values which create the greatest separation in groups.通过设计,大多数树算法为每个变量(包括虚拟 0./1)选择截止值,以创建最高纯度(将最多的观察移到一侧或另一侧并更接近分类)并在两者之间选择一个点在组中产生最大分离的两个值。

With a binary variable, that split is at .5 because it is midway between the two different values a level can take 0 and 1.对于二元变量,该分割为 0.5,因为它位于两个不同值的中间,一个级别可以取 0 和 1。

Your factor thaliumnormal is either 0 or 1, which represent yes or no - correct?您的因子thaliumnormal是 0 或 1,代表是或否 - 正确吗?

In that case, rpart takes the midvalue 0.5 so that all decision of 0 or 1 is either above or below 0.5 .在这种情况下, rpart取中间值 0.5 ,因此所有01决定要么高于或低于0.5

Values below the cut-off, in this case 0.5, always turns left.低于截止值的值(在本例中为 0.5)始终向左转。 So thaliumnormal==0 turns left, yes.所以thaliumnormal==0左转,是的。

You can see the same example as for sex你可以看到相同的例子,作为sex

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM