简体   繁体   English

如何使用R在随机森林中生成决策树图和变量重要性图?

[英]How do I generate a Decision Tree plot and a Variable Importance plot in Random Forest using R?

I am new to Data Science and I am working on a Machine Learning analysis using Random Forest algorithm to perform a classification. 我是数据科学的新手,我正在使用随机森林算法进行分类的机器学习分析。 My target variable in my data set is called Attrition (Yes/No). 我在数据集中的目标变量称为损耗(是/否)。

I am a bit confused as to how to generate these 2 plots in Random Fores`: 对于如何在Random Fores中生成这2个图,我有些困惑:

(1) Feature Importance Plot

(2) Decision Tree Plot

I understand that Random Forest is a ensemble of several Decision Tree models from the data set. 我了解到,随机森林是数据集中几个决策树模型的集合。

Assuming my Training data set is called TrainDf and my Testing data set is called TestDf , how can I create these 2 plots in R? 假设我的训练数据集称为TrainDf而我的测试数据集称为TestDf ,如何在R中创建这2个图?

UPDATE: From these 2 posts, it seems that they cannot be done, or am I missing something here? 更新:从这两篇文章看来,他们无法完成,还是我在这里错过了什么? Why is Random Forest with a single tree much better than a Decision Tree classifier? 为什么带有一棵树的随机森林比决策树分类器好得多?

How would you interpret an ensemble tree model? 您将如何解释整体树模型?

To plot the variable importance, you can use the below code. 要绘制变量重要性,可以使用以下代码。

mtcars.rf <- randomForest(am ~ ., data=mtcars, ntree=1000, keep.forest=FALSE,
                      importance=TRUE)
varImpPlot(mtcars.rf)

Feature importance plot with ggplot2 , 具有ggplot2 特征重要性图

library(randomForest)
library(ggplot2)
mtcars.rf <- randomForest(vs ~ ., data=mtcars)
imp <- cbind.data.frame(Feature=rownames(mtcars.rf$importance),mtcars.rf$importance)
g <- ggplot(imp, aes(x=reorder(Feature, -IncNodePurity), y=IncNodePurity))
g + geom_bar(stat = 'identity') + xlab('Feature')

在此处输入图片说明

A Decision Tree plot with igraph (a tree from the random forest) 决策树情节igraph (从随机森林树)

tree <- randomForest::getTree(mtcars.rf, k=1, labelVar=TRUE) # get the 1st decision tree with k=1
tree$`split var` <- as.character(tree$`split var`)
tree$`split point` <- as.character(tree$`split point`)
tree[is.na(tree$`split var`),]$`split var` <- ''
tree[tree$`split point` == '0',]$`split point` <- ''

library(igraph)
gdf <- data.frame(from = rep(rownames(tree), 2),
                          to = c(tree$`left daughter`, tree$`right daughter`))
g <- graph_from_data_frame(gdf, directed=TRUE)
V(g)$label <- paste(tree$`split var`, '\r\n(', tree$`split point`, ',', round(tree$prediction,2), ')')
g <- delete_vertices(g, '0')
print(g, e=TRUE, v=TRUE)
plot(g, layout = layout.reingold.tilford(g, root=1), vertex.size=5, vertex.color='cyan')

As can be seen from the following plot, the the label for each node in the decision tree represents the variable name chosen for split at that node, (the split value, the proportion of class with label 1) at that node. 从下图可以看出,决策树中每个节点的标签表示在该节点选择用于拆分的变量名称(拆分值,带有标签1的类的比例)。

在此处输入图片说明

Likewise the 100th tree can be obtained with k=100 with the randomForest::getTree() function which looks like the following 同样,可以通过randomForest::getTree()函数使用k=100获得第100棵树,如下所示

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM