简体繁体 English

GradientBoostingClassifier 的特征重要性是如何计算的

[英]How is feature importance calculated for GradientBoostingClassifier

原文 2017-05-24 16:06:44 5 1 python/ machine-learning/ scikit-learn/ feature-selection

I'm using scikit-learn's gradient-boosted trees classifier, GradientBoostingClassifier .我正在使用 scikit-learn 的梯度提升树分类器GradientBoostingClassifier 。 It makes feature importance score available in feature_importances_ .它使feature_importances_特征重要性评分可用。 How are these feature importances calculated?这些特征重要性是如何计算的？

I'd like to understand what algorithm scikit-learn is using, to help me understand how to interpret those numbers.我想了解 scikit-learn 正在使用什么算法，以帮助我了解如何解释这些数字。 The algorithm isn't listed in the documentation.该算法未在文档中列出。

1 个解决方案

This is documented elsewhere in the scikit-learn documentation.这在 scikit-learn 文档的其他地方有记录。 In particular, here is how it works:特别是，它是如何工作的：

For each tree, we calculate the feature importance of a feature F as the fraction of samples that will traverse a node that splits based on feature F (see here ).对于每棵树，我们将特征 F 的特征重要性计算为将遍历基于特征 F 分裂的节点的样本的分数（参见此处）。 Then, we average those numbers across all trees (as described here ).然后，我们平均在所有的树木（如描述的那些数字在这里）。

It is not described exactly how scikit-learn estimates the fraction of nodes that will traverse a tree node that splits on feature F.没有准确描述 scikit-learn 如何估计将遍历在特征 F 上分裂的树节点的节点的分数。

The interpretation: scores will be in the range [0,1].解释：分数将在 [0,1] 范围内。 Higher scores mean the feature is more important.分数越高意味着该功能越重要。 This is an array with shape (n_features,) whose values are positive and sum to 1.0这是一个形状为 (n_features,) 的数组，其值为正且总和为 1.0