简体   繁体   English

Plot 决策树分类器

[英]Plot the Decision Tree Classifier

In my dataset I have a binary Target (0 or 1) variable, and 8 features: nchar , rtc , Tmean , week_day , hour , ntags , nlinks and nex .在我的dataset ,我有一个二进制Target (0 或 1)变量和 8 个特征: ncharrtcTmeanweek_dayhourntagsnlinksnex week_day is a factor while the others are numeric. week_day是一个因素,而其他因素是数字。 I'm trying to build a decision tree classifier:我正在尝试构建一个决策树分类器:

library(caTools)
set.seed(123)
split = sample.split(dataset$Target, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Feature Scaling
training_set[-c(2,4)] = scale(training_set[-c(2,4)])
test_set[-c(2,4)] = scale(test_set[-c(2,4)])

# Fitting Decision Tree Classification to the Training set
# install.packages('rpart')
library(rpart)
classifier = rpart(formula = Target ~ .,
                   data = training_set)

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-2], type = 'class')

# Making the Confusion Matrix
cm = table(test_set[, 2], y_pred)

plot(classifier, uniform=TRUE,margin=0.2)

The result of the plot is the following: plot 的结果如下:

在此处输入图像描述

I have three questions I don't know the answers to:我有三个我不知道答案的问题:

  1. why in the plot some variables are missing?为什么在 plot 中缺少一些变量? (Eg rtc ) (例如rtc
  2. what does aefg in week_day mean? aefg中的week_day是什么意思?
  3. is there a way to describe the different classes (0 vs 1 for the Target variable)?有没有办法描述不同的类( Target变量为 0 对 1)? For example: in Target=1 we have all the rows that have nchar>0.19 and ntags>1.9 , etc.例如:在Target=1中,我们有所有具有nchar>0.19ntags>1.9的行,等等。

Here an explaination with some data that you can fetch in the package rpart :这里解释了一些您可以在 package rpart中获取的数据:

library(rpart)   # for decision tree
library(rattle)  # to do a nicer plot

 progstat <- factor(stagec$pgstat, levels = 0:1, labels = c("No", "Prog"))
 cfit     <- rpart(progstat ~  age + eet + g2 + grade + gleason + ploidy,
                   data = stagec,
                   method ='class')

Question 1 : why some variables are out?问题1 :为什么有些变量出来了?
Because those variables are not useful for your model or, said better, you've said to your model to not get variables under a parameter cp (default = 0.01).因为这些变量对您的 model 没有用,或者更好地说,您已经对 model 说过不要在参数cp (默认值 = 0.01)下获取变量。
Looking at the doc for the cp parameter:查看cp参数的文档

(...)Essentially,the user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence the program need not pursue it. (...)本质上,用户通知程序任何不能通过 cp 改进拟合的拆分都可能会被交叉验证剪除,因此程序不需要继续它。

I think that the doc explain better technically than me and, if I have to tell it by simple words, the cp parameter set the baseline of "utility" of a node.我认为文档在技术上比我解释得更好,如果我必须用简单的话来告诉它, cp参数设置了节点“实用程序”的基线。
If the node is made by a useless variable, it is cut out, so the useless (read: no further infos in the model by the variable) variables don't appear.如果节点是由一个无用的变量构成的,它会被切掉,所以无用的(阅读:model 中没有更多信息)变量不会出现。 Try to set the parameter in your model and you'll see how it change.尝试在您的 model 中设置参数,您会看到它是如何变化的。 In my case, the eet variable is out.就我而言, eet变量已失效。
Sending this:发送这个:

 summary(cfit)
Call:
rpart(formula = progstat ~ age + eet + g2 + grade + gleason + 
    ploidy, data = stagec, method = "class")
  n= 146 

          CP nsplit rel error    xerror      xstd
1 0.10493827      0 1.0000000 1.0000000 0.1080241
2 0.05555556      3 0.6851852 1.0555556 0.1091597
3 0.02777778      4 0.6296296 0.9629630 0.1071508
4 0.01851852      6 0.5740741 0.9629630 0.1071508
5 0.01000000      7 0.5555556 0.9814815 0.1075992

Variable importance
     g2   grade gleason  ploidy     age     eet 
     30      28      20      13       7       2 

(... it continues...)

You can see that eet is the least important.你可以看到eet是最不重要的。

Question 2 : what does aefg in week_day mean?问题2aefg中的week_day是什么意思?
It means that the split is made by some of the week_day on the left, and some on the right.这意味着拆分是由左侧的一些week_day和右侧的一些进行的。 It should be a categorical variable.它应该是一个分类变量。
Try to use this, instead the classical plot:尝试使用这个,而不是经典的 plot:

fancyRpartPlot(cfit, caption = NULL)

在此处输入图像描述

You can see that the ploid and tetraploid are sent to the left, the other, on the right.你可以看到倍体和四倍体被送到左边,另一个在右边。 From here :这里

(...)The tree is arranged so that the “more severe” nodes go to the right (...)树的排列使得“更严重”的节点 go 向右

Question 3 : is there a way to describe the different classes (0 vs 1 for the Target variable)?问题 3 :有没有办法描述不同的类(目标变量为 0 对 1)?
In this case the variable is progstat , but you can transport the explaination to your variable.在这种情况下,变量是progstat ,但您可以将解释传输到您的变量。
This is how generally I read those results in the plot:这就是我通常在 plot 中读取这些结果的方式:

Looking at the first node (the most important): it tells us that 63% are "no", an 37% "prog" (read yes).查看第一个节点(最重要的):它告诉我们 63% 是“否”,37% 是“prog”(读为是)。 That node covers the 100% of the population.该节点覆盖了 100% 的人口。

The second most important node is the 2, and the variable that take in is grade < 2.5.第二个最重要的节点是 2,取入的变量是grade < 2.5。 Otherwise, you go in node three.否则,您在节点三中使用 go。

If you go on the left, you have the 42% of the population.如果你在左边的 go,你有 42% 的人口。 The label of that population is No, but, 85% of the population is real No, the other are mislabelled No.那个人群的label是No,但是,85%的人群是真正的No,其他的都是错误的No。

TL;DR TL;博士
This mean that " The overall population is divided in No and Prog, at 63% and 27%.这意味着“总人口分为 No 和 Prog,分别为 63% 和 27%。
If the variable grade is < 2.5, the model says that in our data, 42% of population has that value of grade , and in that 42%, 85% of them have has result No. Probably grade and the result of the dependant variable "no" are bonded".如果变量grade < 2.5,model 表示在我们的数据中,42% 的人口具有grade的值,在这 42% 中,85% 的人有结果号。可能grade和因变量的结果“不”是绑定的”。
In this way you can check all the nodes in your plot and also using summary() , to see the most important patterns.通过这种方式,您可以检查 plot 中的所有节点,还可以使用summary()来查看最重要的模式。

In your plot, you can say that "if ntags > 1.952 and nchar < 0.1449, then I have a 0.在您的 plot 中,您可以说“如果ntags > 1.952 并且nchar < 0.1449,那么我有一个 0。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM