简体繁体 English

将随机森林变成决策树-在R中使用randomForest包

[英]Turning a Random Forest into a Decision Tree - Using randomForest package in R

原文 2014-04-29 07:11:22 1 1 r/ random-forest

Is it possible to generate a decision forest whose trees are exactly the same? 是否可以生成决策树的树完全相同？ Please note that this is an experimental question. 请注意，这是一个实验性问题。 As far as I understand random forests have two parameters that lead to the 'randomness' compared to a single decision tree: 据我了解，与单个决策树相比，随机森林具有两个导致“随机性”的参数：

1) number of features randomly sampled at each node of a decision tree, and 1）在决策树的每个节点上随机采样的要素数量，以及

2) number of training examples drawn to create a tree. 2）绘制一些训练示例来创建树。

Intuitively, if I set these two parameters to their maximum values, then I should be avoiding the 'randomness', hence each created tree should be exactly the same. 直观地讲，如果我将这两个参数设置为其最大值，则应避免使用“随机性”，因此，每个创建的树都应该完全相同。 Because all the trees would exactly be the same, I should be achieving the same results regardless the number of trees in the forest or different runs (ie different seed values). 因为所有树木都是完全相同的，所以无论森林中树木的数量或不同的行径（即不同的种子值），我都应该获得相同的结果。

I have tested this idea using the randomForest library within R. I think the two aforementioned parameters correspond to 'mtry' and 'sampsize' respectively. 我已经使用R中的randomForest库测试了这个想法。我认为上述两个参数分别对应于'mtry'和'sampsize'。 I have set these values to their maximum, but unfortunately there is still some randomness left, as the OOB-error estimates vary depending on the number of trees in the forest?! 我将这些值设置为最大值，但是不幸的是，仍然存在一些随机性，因为OOB误差估计值取决于森林中树木的数量？

Would you please help me understand how to remove all the randomness in a random decision forest, prefarably using the arguments of the randomForest library within R? 您能否帮助我理解如何充分地使用R中的randomForest库的参数来消除随机决策林中的所有随机性？

1 个解决方案

In addition to mtry and sampsize, there's another relevant argument in randomForest(): replace. 除了mtry和sampsize之外，randomForest（）中还有另一个相关的参数：replace。 By default the sampling of data points to grow each tree is done with replacement. 默认情况下，通过替换来完成用于生长每棵树的数据点的采样。 If you want all data points to be used in all trees, not only you need to set sampsize to the number of data points, but also set replace=FALSE. 如果要在所有树中使用所有数据点，则不仅需要将sampsize设置为数据点的数量，还需要设置replace = FALSE。

Here's a toy example to show that you can get a forest of identical trees: 这是一个玩具示例，展示了您可以得到一棵相同树木的森林：

library(randomForest) 库（随机森林）

set.seed(17) set.seed（17）

x <- matrix(sample(5, 50, replace=TRUE), 10, 5) x <-矩阵（sample（5，50，replace = TRUE），10，5）

y <- factor(sample(2, 10, replace=TRUE)) y <-factor（sample（2，10，replace = TRUE））

rf1 <- randomForest(x, y, mtry=ncol(x), sampsize=nrow(x), replace=FALSE, ntree=5) rf1 <-randomForest（x，y，mtry = ncol（x），sampsize = nrow（x），replace = FALSE，ntree = 5）