简体   繁体   English

scikit learn - 在决策树中进行特征重要性计算

[英]scikit learn - feature importance calculation in decision trees

I'm trying to understand how feature importance is calculated for decision trees in sci-kit learn. 我试图了解如何计算sci-kit学习中的决策树的特征重要性。 This question has been asked before, but I am unable to reproduce the results the algorithm is providing. 之前已经问过这个问题,但我无法重现算法提供的结果。

For example: 例如:

from StringIO import StringIO

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_graphviz
from sklearn.feature_selection import mutual_info_classif

X = [[1,0,0], [0,0,0], [0,0,1], [0,1,0]]

y = [1,0,1,1]

clf = DecisionTreeClassifier()
clf.fit(X, y)

feat_importance = clf.tree_.compute_feature_importances(normalize=False)
print("feat importance = " + str(feat_importance))

out = StringIO()
out = export_graphviz(clf, out_file='test/tree.dot')

results in feature importance: 导致特征重要性:

feat importance = [0.25       0.08333333 0.04166667]

and gives the following decision tree: 并给出以下决策树:

决策树

Now, this answer to a similar question suggests the importance is calculated as 现在,这个答案对一个类似问题建议的重要性计算公式为

formula_a

Where G is the node impurity, in this case the gini impurity. 其中G是节点杂质,在这种情况下是基尼杂质。 This is the impurity reduction as far as I understood it. 据我所知,这是杂质减少。 However, for feature 1 this should be: 但是,对于功能1,这应该是:

formula_b

This answer suggests the importance is weighted by the probability of reaching the node (which is approximated by the proportion of samples reaching that node). 这个答案表明重要性由到达节点的概率加权(通过到达该节点的样本的比例来近似)。 Again, for feature 1 this should be: 同样,对于功能1,这应该是:

formula_c

Both formulas provide the wrong result. 两个公式都提供了错误的结果。 How is the feature importance calculated correctly? 如何正确计算特征重要性?

I think feature importance depends on the implementation so we need to look at the documentation of scikit-learn. 我认为功能重要性取决于实现,所以我们需要查看scikit-learn的文档。

The feature importances. 功能重要性。 The higher, the more important the feature. 功能越高,功能越重要。 The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. 特征的重要性计算为该特征带来的标准的(标准化的)总减少量。 It is also known as the Gini importance 它也被称为基尼的重要性

That reduction or weighted information gain is defined as : 减少或加权信息增益定义为:

The weighted impurity decrease equation is the following: 加权杂质减少方程式如下:

N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child. 其中N是样本总数,N_t是当前节点的样本数,N_t_L是左子节点中的样本数,N_t_R是右子节点中的样本数。

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

Since each feature is used once in your case, feature information must be equal to equation above. 由于每种特征在您的情况下使用一次,因此特征信息必须等于上面的等式。

For X[2] : 对于X [2]:

feature_importance = (4 / 4) * (0.375 - (0.75 * 0.444)) = 0.042

For X[1] : 对于X [1]:

feature_importance = (3 / 4) * (0.444 - (2/3 * 0.5)) = 0.083

For X[0] : 对于X [0]:

feature_importance = (2 / 4) * (0.5) = 0.25

A single feature can be used in the different branches of the tree, feature importance then is it's total contribution in reducing the impurity. 可以在树的不同分支中使用单个特征,特征重要性是它在减少杂质方面的总贡献。

feature_importance += number_of_samples_at_parent_where_feature_is_used\*impurity_at_parent-left_child_samples\*impurity_left-right_child_samples\*impurity_right

impurity is the gini/entropy value 杂质是基尼/熵值

normalized_importance = feature_importance/number_of_samples_root_node(total num of samples)

In the above eg: 在上面例如:

feature_2_importance = 0.375*4-0.444*3-0*1 = 0.16799 , 
normalized = 0.16799/4(total_num_of_samples) = 0.04199

If feature_2 was used in other branches calculate the it's importance at each such parent node & sum up the values. 如果feature_2在其他分支中使用,则计算它在每个这样的父节点上的重要性并总结这些值。

There is a difference in the feature importance calculated & the ones returned by the library as we are using the truncated values seen in the graph. 计算的特征重要性与库返回的特征重要性存在差异,因为我们使用的是图中看到的截断值。

Instead, we can access all the required data using the 'tree_' attribute of the classifier which can be used to probe the features used, threshold value, impurity, no of samples at each node etc.. 相反,我们可以使用分类器的'tree_'属性访问所有需要的数据,该属性可用于探测所使用的特征,阈值,杂质,每个节点的样本数等。

eg: clf.tree_.feature gives the list of features used. 例如: clf.tree_.feature给出使用的功能列表。 A negative value indicates it's a leaf node. 负值表示它是叶节点。

Similarly clf.tree_.children_left/right gives the index to the clf.tree_.feature for left & right children 同样clf.tree_.children_left/right给出了指数的clf.tree_.feature为左,右的孩子

Using the above traverse the tree & use the same indices in clf.tree_.impurity & clf.tree_.weighted_n_node_samples to get the gini/entropy value and number of samples at the each node & at it's children. 使用上面的方法遍历树并使用clf.tree_.impurity & clf.tree_.weighted_n_node_samples的相同索引来获取每个节点及其子节点的clf.tree_.impurity & clf.tree_.weighted_n_node_samples / entropy值和样本数。

def dt_feature_importance(model,normalize=True):

    left_c = model.tree_.children_left
    right_c = model.tree_.children_right

    impurity = model.tree_.impurity    
    node_samples = model.tree_.weighted_n_node_samples 

    # Initialize the feature importance, those not used remain zero
    feature_importance = np.zeros((model.tree_.n_features,))

    for idx,node in enumerate(model.tree_.feature):
        if node >= 0:
            # Accumulate the feature importance over all the nodes where it's used
            feature_importance[node]+=impurity[idx]*node_samples[idx]- \
                                   impurity[left_c[idx]]*node_samples[left_c[idx]]-\
                                   impurity[right_c[idx]]*node_samples[right_c[idx]]

    # Number of samples at the root node
    feature_importance/=node_samples[0]

    if normalize:
        normalizer = feature_importance.sum()
        if normalizer > 0:
            feature_importance/=normalizer

    return feature_importance

This function will return the exact same values as returned by clf.tree_.compute_feature_importances(normalize=...) 此函数将返回与clf.tree_.compute_feature_importances(normalize=...)返回的完全相同的值clf.tree_.compute_feature_importances(normalize=...)

To sort the features based on their importance 根据功能的重要性对功能进行排序

features = clf.tree_.feature[clf.tree_.feature>=0] # Feature number should not be negative, indicates a leaf node
sorted(zip(features,dt_feature_importance(clf,False)[features]),key=lambda x:x[1],reverse=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM