sklearn min_impurity_decrease 解释

Question

The definition of min_impurity_decrease in sklearn is sklearn中min_impurity_decrease的定义是

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.如果此分裂导致杂质减少大于或等于该值，则该节点将被分裂。

Using the Iris dataset, and putting min_impurity_decrease = 0.0使用 Iris 数据集，并设置 min_impurity_decrease = 0.0

How the tree looks when min_impurity_decrease = 0.0当 min_impurity_decrease = 0.0 时树的外观

Putting min_impurity_decrease = 0.1, we will obtain this:设置 min_impurity_decrease = 0.1，我们将得到：

How the tree looks when min_impurity_decrease = 0.1当 min_impurity_decrease = 0.1 时树的外观

Looking at the green square where gini index (impurity) = 0.2041, why was it not split when we put min_impurity_decrease = 0.1 although the the gini index (impurity) left = 0.0 and the gini index (impurity) right = 0.375看绿色方块，其中基尼指数（杂质）= 0.2041，为什么当我们把min_impurity_decrease = 0.1时它没有分裂，尽管基尼指数（杂质）左= 0.0，基尼指数（杂质）右= 0.375

Does this mean to prune all the children node where, when pruned, their parent node gini index will become less than 0.1 ?这是否意味着修剪所有子节点，修剪后，它们的父节点基尼指数将小于 0.1 ？ Becuase, if this is the case, then why did we not prune the second level node having gini = 0.487), which is bigger than 0.1 ?因为，如果是这种情况，那么为什么我们不修剪具有大于 0.1 的 gini = 0.487) 的第二级节点？

Answer 1

Steve, this reply is late, but posting here in case others run across this problem and would like to know more about the min impurity decrease.史蒂夫，这个回复迟到了，但在这里发布以防其他人遇到这个问题并想了解更多关于最小杂质减少的信息。

The min impurity decrease function formula can be found here .可以在此处找到最小杂质减少函数公式。 The formula is defined as:公式定义为：

N_t / N * (impurity - N_t_R / N_t * right_impurity
                - N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.其中N是样本总数，N_t是当前节点的样本数，N_t_L是左孩子的样本数，N_t_R是右孩子的样本数。

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed. N、N_t、N_t_R 和 N_t_L 都是指加权和，如果通过了 sample_weight。

Therefore, in your example:因此，在您的示例中：

N_t = 26
N = 90
N_t_R = 4
N_t_L = 22
impurity = 0.2041
right impurity = 0.375
left impurity = 0

I calculated the impurity decrease as 0.04, which does not meet the threshold you specified of 0.1.我计算出的杂质减少量为 0.04，这不符合您指定的阈值 0.1。 So in essence, this formula takes into account how much the parent node makes up of the total tree (N_t / N) and the weighted impurity decrease from the child nodes.所以本质上，这个公式考虑了父节点占总树的多少（N_t / N）以及从子节点减少的加权杂质。 If the final impurity decrease is less than the minimum impurity decrease parameter, then the split will not be performed.如果最终的杂质减少量小于最小杂质减少量参数，则不会执行拆分。

sklearn min_impurity_decrease 解释

问题描述

1 个解决方案

解决方案1
6 2020-01-12 16:58:03

sklearn min_impurity_decrease 解释

问题描述

1 个解决方案

解决方案1 6 2020-01-12 16:58:03

解决方案1
6 2020-01-12 16:58:03