sklearn RandomForestRegressor 显示的树值中的差异

Question

while using the RandomForestRegressor I noticed something strange.在使用 RandomForestRegressor 时，我注意到一些奇怪的事情。 To illustrate the problem, here a small example.为了说明问题，这里有一个小例子。 I applied the RandomForestRegressor on a test dataset and plotted the graph of the first tree in the forest.我在测试数据集上应用了 RandomForestRegressor 并绘制了森林中第一棵树的图。 This gives me the following output:这给了我以下输出：

Root_node: 
mse=8.64
samples=2
value=20.4

Left_leaf: 
mse=0
samples=1
value=24

Right_leaf: 
mse=0
samples=1
value=18

First, I expected the root node to have a value of (24+18)/2=21 .首先，我希望根节点的值为(24+18)/2=21 。 But somehow it is 20.4.但不知何故，它是 20.4。 However, even if this value is correct, how do I get a mse of 8.64?但是，即使这个值是正确的，我如何获得 8.64 的 mse？ From my point of view it is supposed to be: 1/2[(24-20.4)^2+(18-20.4)^2]=9.36 (under the assumption that the root value of 20.4 is correct)从我的角度来看，它应该是： 1/2[(24-20.4)^2+(18-20.4)^2]=9.36 （假设根值 20.4 是正确的）

My solution is: 1/2[(24-21)^2+(18-21)^2]=9 .我的解决方案是： 1/2[(24-21)^2+(18-21)^2]=9 。 This is also what I get if I just use the DecisionTreeRegressor.如果我只使用 DecisionTreeRegressor，这也是我得到的结果。

Is there something wrong in the implementation of the RandomForestRegressor or am I completely wrong? RandomForestRegressor 的实现有什么问题还是我完全错了？

Here is my reproducible code:这是我的可重现代码：

import pandas as pd
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
import graphviz

# create example dataset
data = {'AGE': [91, 42, 29, 94, 85], 'TAX': [384, 223, 280, 666, 384], 'Y': [19, 21, 24, 13, 18]}
df = pd.DataFrame(data=data)
x = df[['AGE','TAX']]
y = df[['Y']]

rf_reg = RandomForestRegressor(max_depth=2, random_state=1)
rf_reg.fit(x,y)

# plot a single tree of forest
dot_data = tree.export_graphviz(rf_reg.estimators_[0], out_file=None, feature_names=x.columns)
graph = graphviz.Source(dot_data)
graph

and the output graph:和输出图：

Answer 1

tl;dr tl;博士

It is due to the bootstrap sampling .这是由于引导抽样。

In detail :详细：

With the default setting bootstrap=True , RF will use bootstrap sampling when building the individual trees;使用默认设置bootstrap=True ，RF 将在构建单个树时使用 bootstrap 采样； quoting from the Cross Validated thread Number of Samples per-Tree in a Random Forest :引用来自交叉验证线程随机森林中每棵树的样本数：

If bootstrap=True , then for each tree, N samples are drawn randomly with replacement from the training set and the tree is built on this new version of the training data.如果bootstrap=True ，那么对于每棵树，从训练集中随机抽取 N 个样本，并在这个新版本的训练数据上构建树。 This introduces randomness in the training procedure since trees will each be trained on slightly different training sets.这在训练过程中引入了随机性，因为每棵树都将在略有不同的训练集上进行训练。 In expectation, drawing N samples with replacement from a dataset of size N will select ~2/3 unique samples from the original set.在预期中，从大小为 N 的数据集中抽取 N 个带有替换的样本将从原始集合中选择约 2/3 的唯一样本。

" With replacement " means that some samples may be chosen more than once, while others will be left out, with the total number of chosen samples remaining equal to the number of samples of the original dataset (here 5). “有替换”意味着一些样本可能会被多次选择，而另一些则会被排除在外，选择的样本总数仍然等于原始数据集的样本数（这里是 5 个）。

What actually has happened in the tree you show is that, despite Graphviz displaying samples=2 , this should be understood as the number of unique samples;您显示的树中实际发生的情况是，尽管 Graphviz 显示samples=2 ，但这应该理解为唯一样本的数量； there are in total 5 (bootstrap) samples in the root node: 2 copies of the sample with y=24 and 3 copies of the one with y=18 (recall that by the definition of the bootstrap sampling procedure, the root node here must contain 5 samples, neither more nor less).根节点中总共有5 个（引导）样本： y=24的样本的 2 个副本和y=18的样本的 3 个副本（回想一下，根据引导采样过程的定义，这里的根节点必须包含 5 个样本，不多也不少）。

Now the displayed values add up:现在显示的值相加：

# value:
(2*24 + 3*18)/5
# 20.4

# mse:
(2*(24-20.4)**2 + 3*(18-20.4)**2)/5
# 8.64

There obviously seems to be some design choice, either in the Graphviz visualization or in the underlying DecisionTreeRegressor , so that only the number of unique samples is stored/displayed, which may (or may not) be a reason for opening a Github issue, but this is how the situation is for now (to be honest, I am not sure myself that I would want the actual total number of samples being displayed here, including the duplicates due to bootstrap sampling).显然似乎有一些设计选择，无论是在 Graphviz 可视化中还是在底层DecisionTreeRegressor ，以便仅存储/显示唯一样本的数量，这可能（也可能不是）是打开 Github 问题的原因，但是这就是目前的情况（老实说，我不确定我是否希望在此处显示实际的样本总数，包括由于引导抽样而产生的重复项）。

sklearn RandomForestRegressor 显示的树值中的差异

问题描述

1 个解决方案

解决方案1
4 2020-09-16 15:53:42

sklearn RandomForestRegressor 显示的树值中的差异

问题描述

1 个解决方案

解决方案1 4 2020-09-16 15:53:42

解决方案1
4 2020-09-16 15:53:42