Plot 决策树分类器

Question

In my dataset I have a binary Target (0 or 1) variable, and 8 features: nchar , rtc , Tmean , week_day , hour , ntags , nlinks and nex .在我的dataset ，我有一个二进制Target （0 或 1）变量和 8 个特征： nchar 、 rtc 、 Tmean 、 week_day 、 hour 、 ntags 、 nlinks和nex 。 week_day is a factor while the others are numeric. week_day是一个因素，而其他因素是数字。 I'm trying to build a decision tree classifier:我正在尝试构建一个决策树分类器：

library(caTools)
set.seed(123)
split = sample.split(dataset$Target, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Feature Scaling
training_set[-c(2,4)] = scale(training_set[-c(2,4)])
test_set[-c(2,4)] = scale(test_set[-c(2,4)])

# Fitting Decision Tree Classification to the Training set
# install.packages('rpart')
library(rpart)
classifier = rpart(formula = Target ~ .,
                   data = training_set)

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-2], type = 'class')

# Making the Confusion Matrix
cm = table(test_set[, 2], y_pred)

plot(classifier, uniform=TRUE,margin=0.2)

The result of the plot is the following: plot 的结果如下：

I have three questions I don't know the answers to:我有三个我不知道答案的问题：

why in the plot some variables are missing?为什么在 plot 中缺少一些变量？ (Eg rtc ) （例如rtc ）
what does aefg in week_day mean? aefg中的week_day是什么意思？
is there a way to describe the different classes (0 vs 1 for the Target variable)?有没有办法描述不同的类（ Target变量为 0 对 1）？ For example: in Target=1 we have all the rows that have nchar>0.19 and ntags>1.9 , etc.例如：在Target=1中，我们有所有具有nchar>0.19和ntags>1.9的行，等等。

Answer 1

Here an explaination with some data that you can fetch in the package rpart :这里解释了一些您可以在 package rpart中获取的数据：

library(rpart)   # for decision tree
library(rattle)  # to do a nicer plot

 progstat <- factor(stagec$pgstat, levels = 0:1, labels = c("No", "Prog"))
 cfit     <- rpart(progstat ~  age + eet + g2 + grade + gleason + ploidy,
                   data = stagec,
                   method ='class')

Question 1 : why some variables are out?问题1 ：为什么有些变量出来了？
Because those variables are not useful for your model or, said better, you've said to your model to not get variables under a parameter cp (default = 0.01).因为这些变量对您的 model 没有用，或者更好地说，您已经对 model 说过不要在参数cp （默认值 = 0.01）下获取变量。
Looking at the doc for the cp parameter:查看cp参数的文档：

(...)Essentially,the user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence the program need not pursue it. （...）本质上，用户通知程序任何不能通过 cp 改进拟合的拆分都可能会被交叉验证剪除，因此程序不需要继续它。

I think that the doc explain better technically than me and, if I have to tell it by simple words, the cp parameter set the baseline of "utility" of a node.我认为文档在技术上比我解释得更好，如果我必须用简单的话来告诉它， cp参数设置了节点“实用程序”的基线。
If the node is made by a useless variable, it is cut out, so the useless (read: no further infos in the model by the variable) variables don't appear.如果节点是由一个无用的变量构成的，它会被切掉，所以无用的（阅读：model 中没有更多信息）变量不会出现。 Try to set the parameter in your model and you'll see how it change.尝试在您的 model 中设置参数，您会看到它是如何变化的。 In my case, the eet variable is out.就我而言， eet变量已失效。
Sending this:发送这个：

 summary(cfit)
Call:
rpart(formula = progstat ~ age + eet + g2 + grade + gleason + 
    ploidy, data = stagec, method = "class")
  n= 146 

          CP nsplit rel error    xerror      xstd
1 0.10493827      0 1.0000000 1.0000000 0.1080241
2 0.05555556      3 0.6851852 1.0555556 0.1091597
3 0.02777778      4 0.6296296 0.9629630 0.1071508
4 0.01851852      6 0.5740741 0.9629630 0.1071508
5 0.01000000      7 0.5555556 0.9814815 0.1075992

Variable importance
     g2   grade gleason  ploidy     age     eet 
     30      28      20      13       7       2 

(... it continues...)

You can see that eet is the least important.你可以看到eet是最不重要的。

Question 2 : what does aefg in week_day mean?问题2 ： aefg中的week_day是什么意思？
It means that the split is made by some of the week_day on the left, and some on the right.这意味着拆分是由左侧的一些week_day和右侧的一些进行的。 It should be a categorical variable.它应该是一个分类变量。
Try to use this, instead the classical plot:尝试使用这个，而不是经典的 plot：

fancyRpartPlot(cfit, caption = NULL)

You can see that the ploid and tetraploid are sent to the left, the other, on the right.你可以看到倍体和四倍体被送到左边，另一个在右边。 From here :从这里：

(...)The tree is arranged so that the “more severe” nodes go to the right (...)树的排列使得“更严重”的节点 go 向右

Question 3 : is there a way to describe the different classes (0 vs 1 for the Target variable)?问题 3 ：有没有办法描述不同的类（目标变量为 0 对 1）？
In this case the variable is progstat , but you can transport the explaination to your variable.在这种情况下，变量是progstat ，但您可以将解释传输到您的变量。
This is how generally I read those results in the plot:这就是我通常在 plot 中读取这些结果的方式：

Looking at the first node (the most important): it tells us that 63% are "no", an 37% "prog" (read yes).查看第一个节点（最重要的）：它告诉我们 63% 是“否”，37% 是“prog”（读为是）。 That node covers the 100% of the population.该节点覆盖了 100% 的人口。

The second most important node is the 2, and the variable that take in is grade < 2.5.第二个最重要的节点是 2，取入的变量是grade < 2.5。 Otherwise, you go in node three.否则，您在节点三中使用 go。

If you go on the left, you have the 42% of the population.如果你在左边的 go，你有 42% 的人口。 The label of that population is No, but, 85% of the population is real No, the other are mislabelled No.那个人群的label是No，但是，85%的人群是真正的No，其他的都是错误的No。

TL;DR TL;博士
This mean that " The overall population is divided in No and Prog, at 63% and 27%.这意味着“总人口分为 No 和 Prog，分别为 63% 和 27%。
If the variable grade is < 2.5, the model says that in our data, 42% of population has that value of grade , and in that 42%, 85% of them have has result No. Probably grade and the result of the dependant variable "no" are bonded".如果变量grade < 2.5，model 表示在我们的数据中，42% 的人口具有grade的值，在这 42% 中，85% 的人有结果号。可能grade和因变量的结果“不”是绑定的”。
In this way you can check all the nodes in your plot and also using summary() , to see the most important patterns.通过这种方式，您可以检查 plot 中的所有节点，还可以使用summary()来查看最重要的模式。

In your plot, you can say that "if ntags > 1.952 and nchar < 0.1449, then I have a 0.在您的 plot 中，您可以说“如果ntags > 1.952 并且nchar < 0.1449，那么我有一个 0。

Plot 决策树分类器

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-03-21 20:33:30

Plot 决策树分类器

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-03-21 20:33:30

解决方案1
3 已采纳 2021-03-21 20:33:30