简体   繁体   English

sklearn RandomForestRegressor 显示的树值中的差异

[英]sklearn RandomForestRegressor discrepancy in the displayed tree values

while using the RandomForestRegressor I noticed something strange.在使用 RandomForestRegressor 时,我注意到一些奇怪的事情。 To illustrate the problem, here a small example.为了说明问题,这里有一个小例子。 I applied the RandomForestRegressor on a test dataset and plotted the graph of the first tree in the forest.我在测试数据集上应用了 RandomForestRegressor 并绘制了森林中第一棵树的图。 This gives me the following output:这给了我以下输出:

Root_node: 
mse=8.64
samples=2
value=20.4

Left_leaf: 
mse=0
samples=1
value=24

Right_leaf: 
mse=0
samples=1
value=18

First, I expected the root node to have a value of (24+18)/2=21 .首先,我希望根节点的值为(24+18)/2=21 But somehow it is 20.4.但不知何故,它是 20.4。 However, even if this value is correct, how do I get a mse of 8.64?但是,即使这个值是正确的,我如何获得 8.64 的 mse? From my point of view it is supposed to be: 1/2[(24-20.4)^2+(18-20.4)^2]=9.36 (under the assumption that the root value of 20.4 is correct)从我的角度来看,它应该是: 1/2[(24-20.4)^2+(18-20.4)^2]=9.36 (假设根值 20.4 是正确的)

My solution is: 1/2[(24-21)^2+(18-21)^2]=9 .我的解决方案是: 1/2[(24-21)^2+(18-21)^2]=9 This is also what I get if I just use the DecisionTreeRegressor.如果我只使用 DecisionTreeRegressor,这也是我得到的结果。

Is there something wrong in the implementation of the RandomForestRegressor or am I completely wrong? RandomForestRegressor 的实现有什么问题还是我完全错了?

Here is my reproducible code:这是我的可重现代码:

import pandas as pd
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
import graphviz

# create example dataset
data = {'AGE': [91, 42, 29, 94, 85], 'TAX': [384, 223, 280, 666, 384], 'Y': [19, 21, 24, 13, 18]}
df = pd.DataFrame(data=data)
x = df[['AGE','TAX']]
y = df[['Y']]

rf_reg = RandomForestRegressor(max_depth=2, random_state=1)
rf_reg.fit(x,y)

# plot a single tree of forest
dot_data = tree.export_graphviz(rf_reg.estimators_[0], out_file=None, feature_names=x.columns)
graph = graphviz.Source(dot_data)
graph

and the output graph:和输出图:

在此处输入图片说明

tl;dr tl;博士

It is due to the bootstrap sampling .这是由于引导抽样

In detail :详细

With the default setting bootstrap=True , RF will use bootstrap sampling when building the individual trees;使用默认设置bootstrap=True ,RF 将在构建单个树时使用 bootstrap 采样; quoting from the Cross Validated thread Number of Samples per-Tree in a Random Forest :引用来自交叉验证线程随机森林中每棵树的样本数

If bootstrap=True , then for each tree, N samples are drawn randomly with replacement from the training set and the tree is built on this new version of the training data.如果bootstrap=True ,那么对于每棵树,从训练集中随机抽取 N 个样本,并在这个新版本的训练数据上构建树。 This introduces randomness in the training procedure since trees will each be trained on slightly different training sets.这在训练过程中引入了随机性,因为每棵树都将在略有不同的训练集上进行训练。 In expectation, drawing N samples with replacement from a dataset of size N will select ~2/3 unique samples from the original set.在预期中,从大小为 N 的数据集中抽取 N 个带有替换的样本将从原始集合中选择约 2/3 的唯一样本。

" With replacement " means that some samples may be chosen more than once, while others will be left out, with the total number of chosen samples remaining equal to the number of samples of the original dataset (here 5). 有替换”意味着一些样本可能会被多次选择,而另一些则会被排除在外,选择的样本总数仍然等于原始数据集的样本数(这里是 5 个)。

What actually has happened in the tree you show is that, despite Graphviz displaying samples=2 , this should be understood as the number of unique samples;您显示的树中实际发生的情况是,尽管 Graphviz 显示samples=2 ,但这应该理解为唯一样本的数量; there are in total 5 (bootstrap) samples in the root node: 2 copies of the sample with y=24 and 3 copies of the one with y=18 (recall that by the definition of the bootstrap sampling procedure, the root node here must contain 5 samples, neither more nor less).根节点中总共有5 个(引导)样本y=24的样本的 2 个副本和y=18的样本的 3 个副本(回想一下,根据引导采样过程的定义,这里的根节点必须包含 5 个样本,不多也不少)。

Now the displayed values add up:现在显示的值相加:

# value:
(2*24 + 3*18)/5
# 20.4

# mse:
(2*(24-20.4)**2 + 3*(18-20.4)**2)/5
# 8.64

There obviously seems to be some design choice, either in the Graphviz visualization or in the underlying DecisionTreeRegressor , so that only the number of unique samples is stored/displayed, which may (or may not) be a reason for opening a Github issue, but this is how the situation is for now (to be honest, I am not sure myself that I would want the actual total number of samples being displayed here, including the duplicates due to bootstrap sampling).显然似乎有一些设计选择,无论是在 Graphviz 可视化中还是在底层DecisionTreeRegressor ,以便仅存储/显示唯一样本的数量,这可能(也可能不是)是打开 Github 问题的原因,但是这就是目前的情况(老实说,我不确定我是否希望在此处显示实际的样本总数,包括由于引导抽样而产生的重复项)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 绘制sklearn RandomForestRegressor MSE - Plot sklearn RandomForestRegressor MSE 如何在 sklearn RandomForestRegressor 中正确预测? - How to predict correctly in sklearn RandomForestRegressor? 在RandomForestRegressor sklearn中绘制要素重要性 - Plot feature importance in RandomForestRegressor sklearn sklearn 中的 RandomForestRegressor 给出负分 - RandomForestRegressor in sklearn giving negative scores 如何在 Python Sklearn RandomForestRegressor 中显示 model 参数 - How to display model parameter in Python Sklearn RandomForestRegressor 使用 sklearn RandomForestRegressor 时,我的数据帧的 x 值是多少? - What is my dataframe's x value when using sklearn RandomForestRegressor? sklearn.RandomForestRegressor 中的 oob_score_ 是如何计算的? - How is oob_score_ calculated in sklearn.RandomForestRegressor? 如何制作一个继承自 Sklearn 的 RandomForestRegressor 的自定义 class? - How to make a custom class which inherits from Sklearn's RandomForestRegressor? 为什么使用 sklearn RandomForestRegressor 对多个目标的预测有时总和为 1? - Why do predictions of multiple targets sometimes sum to 1 with sklearn RandomForestRegressor? 如何解释 sklearn.tree.tree_tree.value 属性的(意外)值? - How to interpret (unexpected) values of sklearn.tree.tree_tree.value attribute?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM