简体   繁体   English

SciKit-Learn 随机森林子样本大小如何可能等于原始训练数据大小?

[英]How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

In the documentation of SciKit-Learn Random Forest classifier , it is stated that在 SciKit-Learn Random Forest classifier 的文档中,指出

The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).子样本大小始终与原始输入样本大小相同,但如果 bootstrap=True(默认),则使用替换绘制样本。

What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection.我不明白的是,如果样本量总是与输入样本量相同,那么我们怎么能谈论随机选择。 There is no selection here because we use all the (and naturally the same) samples at each training.这里没有选择,因为我们在每次训练中使用所有(自然是相同的)样本。

Am I missing something here?我在这里错过了什么吗?

I believe this part of docs answers your question 我相信这部分文档可以回答你的问题

In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (ie, a bootstrap sample) from the training set. 在随机森林中(参见RandomForestClassifier和RandomForestRegressor类),集合中的每个树都是从训练集中用替换(即自举样本)绘制的样本构建的。 In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. 此外,在构造树期间拆分节点时,所选的拆分不再是所有要素中的最佳拆分。 Instead, the split that is picked is the best split among a random subset of the features . 相反,拾取的拆分是功能的随机子集中的最佳拆分 As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model. 由于这种随机性,森林的偏差通常会略微增加(相对于单个非随机树的偏差),但由于平均,其方差也会减小,通常不仅可以补偿偏差的增加,从而产生一个整体更好的模型。

The key to understanding is in "sample drawn with replacement ". 理解的关键在于“ 用替换绘制的样本”。 This means that each instance can be drawn more than once. 这意味着每个实例可以多次绘制。 This in turn means, that some instances in the train set are present several times and some are not present at all (out-of-bag). 这反过来意味着列车组中的某些实例存在多次而一些实际上根本不存在(袋外)。 Those are different for different trees 不同的树木有所不同

Certainly not all samples are selected for each tree. 当然不是每棵树都选择了所有样本。 Be default each sample has a 1-((N-1)/N)^N~0.63 chance of being sampled for one particular tree and 0.63^2 for being sampled twice, and 0.63^3 for being sampled 3 times... where N is the sample size of the training set. 默认情况下,每个样本对一个特定树进行采样的概率为1 - ((N-1)/ N)^ N~0.63,采样两次为0.63 ^ 2,采样3次为0.63 ^ 3 ...其中N是训练集的样本大小。

Each bootstrap sample selection is in average enough different from other bootstraps, such that decision trees are adequately different, such that the average prediction of trees is robust toward the variance of each tree model. 每个引导样本选择与其他引导平均不同,使得决策树充分不同,使得树的平均预测对于每个树模型的方差是稳健的。 If sample size could be increased to 5 times more than training set size, every observation would probably be present 3-7 times in each tree and the overall ensemble prediction performance would suffer. 如果样本大小可以增加到训练集大小的5倍,则每个观察结果可能在每棵树中存在3-7次,并且整体集合预测性能将受到影响。

The answer from @communitywiki misses out the question: "What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection": It has to do with the nature of bootstrapping itself. 来自@communitywiki的答案忽略了一个问题:“我不明白的是,如果样本大小总是与输入样本大小相同,那么我们如何谈论随机选择”:它与自举的性质有关本身。 Bootstrapping includes repeating the same values different times but still have same sample size as original data: Example (courtesy wiki page of Bootstrapping/Approach): Bootstrapping包括重复相同的值不同的时间但仍然具有与原始数据相同的样本大小:示例(Bootstrapping / Approach的礼貌维基页面 ):

  • Original Sample : [1,2,3,4,5] 原始样本:[1,2,3,4,5]
  • Boostrap 1 : [1,2,4,4,1] Boostrap 1:[1,2,4,4,1]
  • Bootstrap 2: [1,1,3,3,5] 引导2:[1,1,3,3,5]

    and so on. 等等。

This is how random selection can occur and still sample size can remain same. 这是随机选择可以发生的方式,样本大小仍然可以保持相同。

Although I am pretty new to python, I had a similar problem.虽然我对 python 很陌生,但我遇到了类似的问题。

I tried to fit a RandomForestClassifier to my data.我试图将 RandomForestClassifier 拟合到我的数据中。 I splitted the data into train and test:我将数据分成训练和测试:

train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.2, random_state=0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM