简体繁体 English

如果使用Scikit-Learn库的RandomForestRegressor有多个输出，如何计算拆分中的杂质减少

[英]How is the impurity decrease of a split computed in case we have multiple outputs using RandomForestRegressor of Scikit-Learn library

原文 2018-06-06 08:23:48 6 1 python/ machine-learning/ scikit-learn/ random-forest

I am using the RandomForestRegressor class of the scikit-learn library (python 3.x) and I am aware that the function to measure the quality of a split in a decision tree is the variance reduction (mse). 我正在使用scikit-learn库（python 3.x）的RandomForestRegressor类，并且我知道用来衡量决策树中分割质量的函数是方差减少（mse）。 Given that the RandomForestRegressor class supports multiple output, my question is: how is the quality of a split computed in case of multiple outputs in this particular class? 鉴于RandomForestRegressor类支持多个输出，我的问题是：在此特定类中有多个输出的情况下，分割的质量如何计算？

From reading the source code of the class defining the split criterion I would say that the impurity decrease of a split in a tree is computed as the average impurity decrease over all output variables. 通过阅读定义拆分标准的类的源代码，我会说，树中拆分的杂质减少是作为所有输出变量上的平均杂质减少而计算的。 And hence, only one model is build given multiple outputs. 因此，只有多个输出才能构建一个模型。 Is that the default way in scikit-learn RandomForestRegressor class? 这是scikit学习RandomForestRegressor类的默认方法吗？ I was hoping someone could have a look with me for I am not completely sure wether my statements are correct! 我希望有人可以和我一起看看，因为我不确定我的陈述是否正确！

Many thanks in advance! 提前谢谢了！

https://github.com/scikit-learn/scikit-learn/blob/a24c8b464d094d2c468a16ea9f8bf8d42d949f84/sklearn/tree/_criterion.pyx#L695 https://github.com/scikit-learn/scikit-learn/blob/a24c8b464d094d2c468a16ea9f8bf8d42d949f84/sklearn/tree/_criterion.pyx#L695

1 个解决方案

One of the authors of the corresponding scikit-learn class (Gilles Louppe) was kind enough to answer my question: The above understanding is correct. 相应的scikit-learn类（Gilles Louppe）的一位作者很友好地回答了我的问题：以上理解是正确的。 The reduction of variance is computed over each class and then averaged to produce the final score. 计算每个类别的方差减少，然后取平均值以产生最终分数。