简体繁体 English

在scikit-learn中平均多个随机森林模型

[英]Averaging Multiple Random Forest Models in scikit-learn

原文 2017-12-22 20:17:57 8 1 python/ machine-learning/ scikit-learn/ random-forest

I have an extremely large dataset and would like to train several random forest models on partitions of the dataset, then average these models to come up with my final classifier. 我有一个非常大的数据集，想在数据集的分区上训练几个随机森林模型，然后对这些模型取平均，以得出我的最终分类器。 Since random forest is an ensemble method, this is an intuitively sound approach but I'm unsure whether it's possible to do using scikit-learn's random forest classifier. 由于随机森林是一种集成方法，因此这是一种直观上合理的方法，但是我不确定使用scikit-learn的随机森林分类器是否可行。 Any ideas? 有任何想法吗？

I'd also be open to using a random forest classifier from another package as well, just not sure where to look. 我也愿意使用另一个包中的随机森林分类器，只是不确定在哪里看。

1 个解决方案

Here is what I can think of: 这是我能想到的：

Pandas + Scikit: You can customize your own bootstrap algorithm where you randomly read a reasonably sized sample from the overall data set, and fit scikit trees on them (would be perfect if you randomize features at each node). Pandas + Scikit：您可以自定义自己的引导程序算法，在该算法中，您可以从总体数据集中随机读取大小合理的样本，然后在其中拟合scikit树（如果在每个节点上对特征进行随机化，那将是完美的选择）。 Then pickle each tree and finally average them out to come up with your random forest. 然后腌制每棵树，最后将它们平均，以得出您的随机森林。
Graphlab + SFrame Turi has its own big data library (SFrame, similar to Pandas) and machine learning library (graphlab, very similar to scikit). Graphlab + SFrame Turi有自己的大数据库（SFrame，类似于Pandas）和机器学习库（graphlab，与scikit非常相似）。 Very beautiful environment. 环境非常优美。
Blaze-Dask might have a little steeper learning curve for some people, but would be an efficient solution. 对于某些人来说， Blaze-Dask的学习曲线可能更陡峭，但这将是一个有效的解决方案。
You can go with the memory-mapped numpy option also but it's going to be more cumbersome than the first three options, and I've never done it so I'll just leave this option here. 您也可以使用memory-mapped numpy选项，但是它将比前三个选项更加麻烦，而且我从未做过，所以我将其保留在此处。