简体   繁体   English

随机森林回归中的样本大小

[英]Size of sample in Random Forest Regression

If understand correctly, when Random Forest estimators are calculated usually bootstrapping is applied, which means that a tree(i) is built only using data from sample(i), chosen with replacement.如果理解正确,当计算随机森林估计量时,通常会应用引导程序,这意味着树(i)仅使用来自样本(i)的数据构建,并选择替换。 I want to know what is the size of the sample that sklearn RandomForestRegressor uses.我想知道 sklearn RandomForestRegressor使用的样本大小是多少

The only thing that I see that is close:我看到的唯一接近的是:

bootstrap : boolean, optional (default=True)
    Whether bootstrap samples are used when building trees.

But there is no way to specify the size or proportion of the sample size, nor does it tell me about the default sample size.但是没有办法指定样本量的大小或比例,也没有告诉我默认的样本量。

I feel like there should be way to at least know what the default sample size is, what am I missing?我觉得应该有办法至少知道默认样本量是多少,我错过了什么?

Uhh, I agree with you it's quite strange that we cannot specify the subsample/bootstrap size in RandomForestRegressor algo.呃,我同意你的看法,很奇怪我们不能在RandomForestRegressor算法中指定子样本/引导程序大小。 Maybe a potential workaround is to use BaggingRegressor instead.也许一个潜在的解决方法是改用BaggingRegressor http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressor http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressor

RandomForestRegressor is just a special case of BaggingRegressor (use bootstraps to reduce the variance of a set of low-bias-high-variance estimators). RandomForestRegressor只是一个特例BaggingRegressor (使用白手起家,以减少一组低偏置高方差估计的方差)。 In RandomForestRegressor , the base estimator is forced to be DeceisionTree , whereas in BaggingRegressor , you have the freedom to choose the base_estimator .RandomForestRegressor ,base estimator 被强制为DeceisionTree ,而在BaggingRegressor ,您可以自由选择base_estimator More importantly, you can set your customized subsample size, for example max_samples=0.5 will draw random subsamples with size equal to half of the entire training set.更重要的是,您可以设置自定义的子样本大小,例如max_samples=0.5将抽取大小等于整个训练集一半的随机子样本。 Also, you can choose just a subset of features by setting max_features and bootstrap_features .此外,您可以通过设置max_featuresbootstrap_features来仅选择一部分功能。

The sample size for bootstrap is always the number of samples. bootstrap 的样本大小始终是样本数。

You are not missing anything, the same question was asked on the mailing list for RandomForestClassifier :您没有遗漏任何东西,在RandomForestClassifier邮件列表中提出了同样的问题:

The bootstrap sample size is always the same as the input sample size.引导样本大小始终与输入样本大小相同。 If you feel up to it, a pull request updating the documentation would probably be quite welcome.如果您愿意,可能会非常欢迎更新文档的拉取请求。

在 scikit-learn 的 0.22 版本中,添加了max_samples选项,执行您的要求: 这里是该类的文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM