[英]Scikit Learn Random forest classifier: How to produce a plot of OOB error against number of trees
Edit 2 : There is now a lovely example in the sklearn documentation on this. 编辑2 : sklearn文档中现在有一个可爱的示例。
In order to see how many trees are necessary in my forest, I'd like to plot the OOB error as the number of trees used in the forest is increased. 为了查看我的森林中需要多少棵树,随着森林中使用的树木数量的增加,我想绘制OOB错误。 I'm in
Python
using a sklearn.ensemble.RandomForestClassifier
but I can't find how to predict using a subset of trees in the forest. 我在
Python
使用sklearn.ensemble.RandomForestClassifier
但我找不到如何使用森林中的树子集进行预测的方法。 I could do this by making a new random forest on each iteration with increasing numbers of trees but this is too expensive. 我可以通过在每次迭代中创建一个新的随机森林来增加树的数量来做到这一点,但这太昂贵了。
It seems a similar task is possible with the Gradient Boosting object using the staged_decision_function
method. 使用
staged_decision_function
方法,使用“梯度增强”对象似乎可以完成类似的任务。 See this example . 请参阅此示例 。
This is quite a simple procedure in R
and can be achieved by simply calling plot(randomForestObject)
: 这在
R
是一个非常简单的过程,可以通过简单地调用plot(randomForestObject)
来实现:
-- Edit -- I see now the RandomForestClassifier
object has an attribute estimators_
which returns all the DecisionTreeClassifier
objects in a list. - 编辑 -我现在看到
RandomForestClassifier
对象具有一个estimators_
属性,该属性返回列表中的所有DecisionTreeClassifier
对象。 So to solve this I can iterate through that list, predicting the results from each tree and taking a 'cumulative average'. 因此,要解决此问题,我可以遍历该列表,预测每棵树的结果并采用“累积平均值”。 However, is there really no easier way to do this already implemented?
但是,真的没有更简单的方法可以执行此操作了吗?
There is a discussion and code in this issue: https://github.com/scikit-learn/scikit-learn/issues/4273 此问题中有讨论和代码: https : //github.com/scikit-learn/scikit-learn/issues/4273
You can add trees one-by-one like this: 您可以像这样一一添加树:
n_estimators = 100
forest = RandomForestClassifier(warm_start=True, oob_score=True)
for i in range(1, n_estimators + 1):
forest.set_params(n_estimators=i)
forest.fit(X, y)
print i, forest.oob_score_
The solution you propose also needs to get the oob indices for each tree, because you don't want to compute the score on all the training data. 您建议的解决方案还需要获取每棵树的oob索引,因为您不想计算所有训练数据的分数。
I still feel this is a strange thing to do as the is really no natural ordering of the trees in the forest. 我仍然觉得这是一件奇怪的事情,因为森林中的树木实际上并不是自然排列的。 Can you explain what you use-case is?
您能解释一下用例是什么吗? Do you want to find the minimum number of trees for a given accuracy to reduce prediction time?
您是否要找到给定精度的最小树数以减少预测时间? If you want fast prediction time, I'd suggest using GradientBoostingClassifier, which is usually much faster.
如果您想要快速的预测时间,我建议您使用GradientBoostingClassifier,它通常要快得多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.