简体   繁体   English

sklearn min_impurity_decrease 解释

[英]sklearn min_impurity_decrease explanation

The definition of min_impurity_decrease in sklearn is sklearn中min_impurity_decrease的定义是

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.如果此分裂导致杂质减少大于或等于该值,则该节点将被分裂。

Using the Iris dataset, and putting min_impurity_decrease = 0.0使用 Iris 数据集,并设置 min_impurity_decrease = 0.0

How the tree looks when min_impurity_decrease = 0.0当 min_impurity_decrease = 0.0 时树的外观

Putting min_impurity_decrease = 0.1, we will obtain this:设置 min_impurity_decrease = 0.1,我们将得到:

How the tree looks when min_impurity_decrease = 0.1当 min_impurity_decrease = 0.1 时树的外观

Looking at the green square where gini index (impurity) = 0.2041, why was it not split when we put min_impurity_decrease = 0.1 although the the gini index (impurity) left = 0.0 and the gini index (impurity) right = 0.375看绿色方块,其中基尼指数(杂质)= 0.2041,为什么当我们把min_impurity_decrease = 0.1时它没有分裂,尽管基尼指数(杂质)左= 0.0,基尼指数(杂质)右= 0.375

Does this mean to prune all the children node where, when pruned, their parent node gini index will become less than 0.1 ?这是否意味着修剪所有子节点,修剪后,它们的父节点基尼指数将小于 0.1 ? Becuase, if this is the case, then why did we not prune the second level node having gini = 0.487), which is bigger than 0.1 ?因为,如果是这种情况,那么为什么我们不修剪具有大于 0.1 的 gini = 0.487) 的第二级节点?

Steve, this reply is late, but posting here in case others run across this problem and would like to know more about the min impurity decrease.史蒂夫,这个回复迟到了,但在这里发布以防其他人遇到这个问题并想了解更多关于最小杂质减少的信息。

The min impurity decrease function formula can be found here .可以在此处找到最小杂质减少函数公式。 The formula is defined as:公式定义为:

N_t / N * (impurity - N_t_R / N_t * right_impurity
                - N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.其中N是样本总数,N_t是当前节点的样本数,N_t_L是左孩子的样本数,N_t_R是右孩子的样本数。

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed. N、N_t、N_t_R 和 N_t_L 都是指加权和,如果通过了 sample_weight。

Therefore, in your example:因此,在您的示例中:

N_t = 26
N = 90
N_t_R = 4
N_t_L = 22
impurity = 0.2041
right impurity = 0.375
left impurity = 0

I calculated the impurity decrease as 0.04, which does not meet the threshold you specified of 0.1.我计算出的杂质减少量为 0.04,这不符合您指定的阈值 0.1。 So in essence, this formula takes into account how much the parent node makes up of the total tree (N_t / N) and the weighted impurity decrease from the child nodes.所以本质上,这个公式考虑了父节点占总树的多少(N_t / N)以及从子节点减少的加权杂质。 If the final impurity decrease is less than the minimum impurity decrease parameter, then the split will not be performed.如果最终的杂质减少量小于最小杂质减少量参数,则不会执行拆分。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 估计器的参数 min_impurity_decrease 无效 - Invalid parameter min_impurity_decrease for estimator 具有 n_estimators=1 和 min_impurity_decrease=0 的 RandomForestClassifier 在复杂数据集上产生 100% 的训练精度 - RandomForestClassifier with n_estimators=1 and min_impurity_decrease=0 yields 100% train accuracy on complex dataset 如何使sklearn.ensemble.RandomForestRegressor不照顾杂质减少启发式 - how to make sklearn.ensemble.RandomForestRegressor not take care of impurity decrease heuristic sklearn光学图解说 - explanation of sklearn optics plot 如果使用Scikit-Learn库的RandomForestRegressor有多个输出,如何计算拆分中的杂质减少 - How is the impurity decrease of a split computed in case we have multiple outputs using RandomForestRegressor of Scikit-Learn library 需要说明最大和最小建筑功能 - need explanation for max and min building function 是否有带参数的 sklearn 库来设置最大值和最小值,so.fit() 是基于最大值和最小值而不是训练集? - Is there sklearn library with parameter to set max and min, so .fit() is based on that max and min instead of the train set? sklearn DecisionTreeClassifier 中 min_samples_split 和 min_samples_leaf 之间的区别 - Difference between min_samples_split and min_samples_leaf in sklearn DecisionTreeClassifier 计算欧氏距离时sklearn.metrics.pairwise_distances_argmin_min的奇怪结果 - Weird results of sklearn.metrics.pairwise_distances_argmin_min when computing euclidean distance Python说明 - Python Explanation
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM