[英]Plot the Decision Tree Classifier
In my dataset
I have a binary Target
(0 or 1) variable, and 8 features: nchar
, rtc
, Tmean
, week_day
, hour
, ntags
, nlinks
and nex
.在我的dataset
,我有一个二进制Target
(0 或 1)变量和 8 个特征: nchar
、 rtc
、 Tmean
、 week_day
、 hour
、 ntags
、 nlinks
和nex
。 week_day
is a factor while the others are numeric. week_day
是一个因素,而其他因素是数字。 I'm trying to build a decision tree classifier:我正在尝试构建一个决策树分类器:
library(caTools)
set.seed(123)
split = sample.split(dataset$Target, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-c(2,4)] = scale(training_set[-c(2,4)])
test_set[-c(2,4)] = scale(test_set[-c(2,4)])
# Fitting Decision Tree Classification to the Training set
# install.packages('rpart')
library(rpart)
classifier = rpart(formula = Target ~ .,
data = training_set)
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-2], type = 'class')
# Making the Confusion Matrix
cm = table(test_set[, 2], y_pred)
plot(classifier, uniform=TRUE,margin=0.2)
The result of the plot is the following: plot 的结果如下:
I have three questions I don't know the answers to:我有三个我不知道答案的问题:
rtc
) (例如rtc
)aefg
in week_day
mean? aefg
中的week_day
是什么意思?Target
variable)?有没有办法描述不同的类( Target
变量为 0 对 1)? For example: in Target=1
we have all the rows that have nchar>0.19
and ntags>1.9
, etc.例如:在Target=1
中,我们有所有具有nchar>0.19
和ntags>1.9
的行,等等。Here an explaination with some data that you can fetch in the package rpart
:这里解释了一些您可以在 package rpart
中获取的数据:
library(rpart) # for decision tree
library(rattle) # to do a nicer plot
progstat <- factor(stagec$pgstat, levels = 0:1, labels = c("No", "Prog"))
cfit <- rpart(progstat ~ age + eet + g2 + grade + gleason + ploidy,
data = stagec,
method ='class')
Question 1 : why some variables are out?问题1 :为什么有些变量出来了?
Because those variables are not useful for your model or, said better, you've said to your model to not get variables under a parameter cp
(default = 0.01).因为这些变量对您的 model 没有用,或者更好地说,您已经对 model 说过不要在参数cp
(默认值 = 0.01)下获取变量。
Looking at the doc for the cp
parameter:查看cp
参数的文档:
(...)Essentially,the user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence the program need not pursue it. (...)本质上,用户通知程序任何不能通过 cp 改进拟合的拆分都可能会被交叉验证剪除,因此程序不需要继续它。
I think that the doc explain better technically than me and, if I have to tell it by simple words, the cp
parameter set the baseline of "utility" of a node.我认为文档在技术上比我解释得更好,如果我必须用简单的话来告诉它, cp
参数设置了节点“实用程序”的基线。
If the node is made by a useless variable, it is cut out, so the useless (read: no further infos in the model by the variable) variables don't appear.如果节点是由一个无用的变量构成的,它会被切掉,所以无用的(阅读:model 中没有更多信息)变量不会出现。 Try to set the parameter in your model and you'll see how it change.尝试在您的 model 中设置参数,您会看到它是如何变化的。 In my case, the eet
variable is out.就我而言, eet
变量已失效。
Sending this:发送这个:
summary(cfit)
Call:
rpart(formula = progstat ~ age + eet + g2 + grade + gleason +
ploidy, data = stagec, method = "class")
n= 146
CP nsplit rel error xerror xstd
1 0.10493827 0 1.0000000 1.0000000 0.1080241
2 0.05555556 3 0.6851852 1.0555556 0.1091597
3 0.02777778 4 0.6296296 0.9629630 0.1071508
4 0.01851852 6 0.5740741 0.9629630 0.1071508
5 0.01000000 7 0.5555556 0.9814815 0.1075992
Variable importance
g2 grade gleason ploidy age eet
30 28 20 13 7 2
(... it continues...)
You can see that eet
is the least important.你可以看到eet
是最不重要的。
Question 2 : what does aefg
in week_day
mean?问题2 : aefg
中的week_day
是什么意思?
It means that the split is made by some of the week_day
on the left, and some on the right.这意味着拆分是由左侧的一些week_day
和右侧的一些进行的。 It should be a categorical variable.它应该是一个分类变量。
Try to use this, instead the classical plot:尝试使用这个,而不是经典的 plot:
fancyRpartPlot(cfit, caption = NULL)
You can see that the ploid and tetraploid are sent to the left, the other, on the right.你可以看到倍体和四倍体被送到左边,另一个在右边。 From here :从这里:
(...)The tree is arranged so that the “more severe” nodes go to the right (...)树的排列使得“更严重”的节点 go 向右
Question 3 : is there a way to describe the different classes (0 vs 1 for the Target variable)?问题 3 :有没有办法描述不同的类(目标变量为 0 对 1)?
In this case the variable is progstat
, but you can transport the explaination to your variable.在这种情况下,变量是progstat
,但您可以将解释传输到您的变量。
This is how generally I read those results in the plot:这就是我通常在 plot 中读取这些结果的方式:
Looking at the first node (the most important): it tells us that 63% are "no", an 37% "prog" (read yes).查看第一个节点(最重要的):它告诉我们 63% 是“否”,37% 是“prog”(读为是)。 That node covers the 100% of the population.该节点覆盖了 100% 的人口。
The second most important node is the 2, and the variable that take in is grade
< 2.5.第二个最重要的节点是 2,取入的变量是grade
< 2.5。 Otherwise, you go in node three.否则,您在节点三中使用 go。
If you go on the left, you have the 42% of the population.如果你在左边的 go,你有 42% 的人口。 The label of that population is No, but, 85% of the population is real No, the other are mislabelled No.那个人群的label是No,但是,85%的人群是真正的No,其他的都是错误的No。
TL;DR TL;博士
This mean that " The overall population is divided in No and Prog, at 63% and 27%.这意味着“总人口分为 No 和 Prog,分别为 63% 和 27%。
If the variable grade
is < 2.5, the model says that in our data, 42% of population has that value of grade
, and in that 42%, 85% of them have has result No. Probably grade
and the result of the dependant variable "no" are bonded".如果变量grade
< 2.5,model 表示在我们的数据中,42% 的人口具有grade
的值,在这 42% 中,85% 的人有结果号。可能grade
和因变量的结果“不”是绑定的”。
In this way you can check all the nodes in your plot and also using summary()
, to see the most important patterns.通过这种方式,您可以检查 plot 中的所有节点,还可以使用summary()
来查看最重要的模式。
In your plot, you can say that "if ntags
> 1.952 and nchar
< 0.1449, then I have a 0.在您的 plot 中,您可以说“如果ntags
> 1.952 并且nchar
< 0.1449,那么我有一个 0。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.