简体   繁体   中英

How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

In the documentation of SciKit-Learn Random Forest classifier , it is stated that

The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection. There is no selection here because we use all the (and naturally the same) samples at each training.

Am I missing something here?

I believe this part of docs answers your question

In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (ie, a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features . As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

The key to understanding is in "sample drawn with replacement ". This means that each instance can be drawn more than once. This in turn means, that some instances in the train set are present several times and some are not present at all (out-of-bag). Those are different for different trees

Certainly not all samples are selected for each tree. Be default each sample has a 1-((N-1)/N)^N~0.63 chance of being sampled for one particular tree and 0.63^2 for being sampled twice, and 0.63^3 for being sampled 3 times... where N is the sample size of the training set.

Each bootstrap sample selection is in average enough different from other bootstraps, such that decision trees are adequately different, such that the average prediction of trees is robust toward the variance of each tree model. If sample size could be increased to 5 times more than training set size, every observation would probably be present 3-7 times in each tree and the overall ensemble prediction performance would suffer.

The answer from @communitywiki misses out the question: "What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection": It has to do with the nature of bootstrapping itself. Bootstrapping includes repeating the same values different times but still have same sample size as original data: Example (courtesy wiki page of Bootstrapping/Approach):

  • Original Sample : [1,2,3,4,5]
  • Boostrap 1 : [1,2,4,4,1]
  • Bootstrap 2: [1,1,3,3,5]

    and so on.

This is how random selection can occur and still sample size can remain same.

Although I am pretty new to python, I had a similar problem.

I splitted the data into train and test:

train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.2, random_state=0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM