Scikit学习随机森林分类器：如何根据树的数量生成OOB错误图

Question

Edit 2 : There is now a lovely example in the sklearn documentation on this. 编辑2 ： sklearn文档中现在有一个可爱的示例。

In order to see how many trees are necessary in my forest, I'd like to plot the OOB error as the number of trees used in the forest is increased. 为了查看我的森林中需要多少棵树，随着森林中使用的树木数量的增加，我想绘制OOB错误。 I'm in Python using a sklearn.ensemble.RandomForestClassifier but I can't find how to predict using a subset of trees in the forest. 我在Python使用sklearn.ensemble.RandomForestClassifier但我找不到如何使用森林中的树子集进行预测的方法。 I could do this by making a new random forest on each iteration with increasing numbers of trees but this is too expensive. 我可以通过在每次迭代中创建一个新的随机森林来增加树的数量来做到这一点，但这太昂贵了。

It seems a similar task is possible with the Gradient Boosting object using the staged_decision_function method. 使用staged_decision_function方法，使用“梯度增强”对象似乎可以完成类似的任务。 See this example . 请参阅此示例。

This is quite a simple procedure in R and can be achieved by simply calling plot(randomForestObject) : 这在R是一个非常简单的过程，可以通过简单地调用plot(randomForestObject)来实现： 针对树木的随机森林OOB错误

-- Edit -- I see now the RandomForestClassifier object has an attribute estimators_ which returns all the DecisionTreeClassifier objects in a list. - 编辑 -我现在看到RandomForestClassifier对象具有一个estimators_属性，该属性返回列表中的所有DecisionTreeClassifier对象。 So to solve this I can iterate through that list, predicting the results from each tree and taking a 'cumulative average'. 因此，要解决此问题，我可以遍历该列表，预测每棵树的结果并采用“累积平均值”。 However, is there really no easier way to do this already implemented? 但是，真的没有更简单的方法可以执行此操作了吗？

Answer 1

There is a discussion and code in this issue: https://github.com/scikit-learn/scikit-learn/issues/4273 此问题中有讨论和代码： https : //github.com/scikit-learn/scikit-learn/issues/4273

You can add trees one-by-one like this: 您可以像这样一一添加树：

n_estimators = 100
forest = RandomForestClassifier(warm_start=True, oob_score=True)

for i in range(1, n_estimators + 1):
    forest.set_params(n_estimators=i)
    forest.fit(X, y)
    print i, forest.oob_score_

The solution you propose also needs to get the oob indices for each tree, because you don't want to compute the score on all the training data. 您建议的解决方案还需要获取每棵树的oob索引，因为您不想计算所有训练数据的分数。

I still feel this is a strange thing to do as the is really no natural ordering of the trees in the forest. 我仍然觉得这是一件奇怪的事情，因为森林中的树木实际上并不是自然排列的。 Can you explain what you use-case is? 您能解释一下用例是什么吗？ Do you want to find the minimum number of trees for a given accuracy to reduce prediction time? 您是否要找到给定精度的最小树数以减少预测时间？ If you want fast prediction time, I'd suggest using GradientBoostingClassifier, which is usually much faster. 如果您想要快速的预测时间，我建议您使用GradientBoostingClassifier，它通常要快得多。

Scikit学习随机森林分类器：如何根据树的数量生成OOB错误图

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-03-28 23:31:48

Scikit学习随机森林分类器：如何根据树的数量生成OOB错误图

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-03-28 23:31:48

解决方案1
1 已采纳 2015-03-28 23:31:48