[英]Effects of Observations on Decision Tree Prediction using rpart (R package)
I'm very new to machine learning so I apologize if the answer to this is very obvious. 我很新的机器学习,所以我很抱歉,如果这个问题的答案是非常明显的。
I'm using a decision tree, using the rpart
package, to attempt to predict when a structure fire may result in a fatality using a variety of variables related to that structure fire such as what was the cause, the extent of damage etc. 我正在使用决策树,并使用rpart
包来尝试使用与该结构火灾相关的各种变量来预测结构火灾何时可能导致死亡,例如原因,损坏程度等。
The chance of a fatality resulting from structure fire is about 1 in 100. 因建筑物起火而导致死亡的几率约为100分之一。
In short I have about 154,000 observations in my training set. 简而言之,我的训练集中有大约154,000个观察值。 I have noticed that when I use the full training set, that the complexity parameter cp
has to be reduced all the way down to .0003
. 我注意到,当我使用完整的训练集时,必须将复杂度参数cp
一直降低到.0003
。
> rpart(Fatality~.,data=train_val,method="class", control=rpart.control(minsplit=50,minbucket = 1, cp=0.00035))
n= 154181
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 154181 1881 0 (0.987800053 0.012199947)
2) losscat=Minor_Loss,Med_Loss 105538 567 0 (0.994627528 0.005372472) *
3) losscat=Major_Loss,Total_Loss 48643 1314 0 (0.972986863 0.027013137)
6) HUM_FAC_1=3,6,N, 46102 1070 0 (0.976790595 0.023209405) *
7) HUM_FAC_1=1,2,4,5,7 2541 244 0 (0.903974813 0.096025187)
14) AREA_ORIG=21,24,26,47,72,74,75,76,Other 1846 126 0 (0.931744312 0.068255688)
28) CAUSE_CODE=1,2,5,6,7,8,9,10,12,14,15 1105 45 0 (0.959276018 0.040723982) *
29) CAUSE_CODE=3,4,11,13,16 741 81 0 (0.890688259 0.109311741)
58) FIRST_IGN=10,12,15,17,18,Other,UU 690 68 0 (0.901449275 0.098550725) *
59) FIRST_IGN=00,21,76,81 51 13 0 (0.745098039 0.254901961)
118) INC_TYPE=111,121 48 10 0 (0.791666667 0.208333333) *
119) INC_TYPE=112,120 3 0 1 (0.000000000 1.000000000) *
15) AREA_ORIG=14,UU 695 118 0 (0.830215827 0.169784173)
30) CAUSE_CODE=1,2,4,7,8,10,11,12,13,14,15,16 607 86 0 (0.858319605 0.141680395) *
31) CAUSE_CODE=3,5,6,9 88 32 0 (0.636363636 0.363636364)
62) HUM_FAC_1=1,2 77 24 0 (0.688311688 0.311688312) *
63) HUM_FAC_1=4,5,7 11 3 1 (0.272727273 0.727272727) *
However, when I just grab the first 10,000 observations (no meaningful order) I can now run with a cp
of .01 但是,当我只获取前10,000个观测值(无有意义的顺序)时,我现在可以以0.01的cp
运行
> rpart(Fatality~., data = test, method = "class",
+ control=rpart.control(minsplit=10,minbucket = 1, cp=0.01))
n= 10000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 10000 112 0 (0.988800000 0.011200000)
2) losscat=Minor_Loss,Med_Loss 6889 26 0 (0.996225867 0.003774133) *
3) losscat=Major_Loss,Total_Loss 3111 86 0 (0.972356156 0.027643844)
6) HUM_FAC_1=3,7,N 2860 66 0 (0.976923077 0.023076923) *
7) HUM_FAC_1=1,2,4,5,6 251 20 0 (0.920318725 0.079681275)
14) CAUSE_CODE=1,3,4,6,7,8,9,10,11,14,15 146 3 0 (0.979452055 0.020547945) *
15) CAUSE_CODE=5,13,16 105 17 0 (0.838095238 0.161904762)
30) weekday=Friday,Monday,Saturday,Tuesday,Wednesday 73 6 0 (0.917808219 0.082191781) *
31) weekday=Sunday,Thursday 32 11 0 (0.656250000 0.343750000)
62) AREA_ORIG=21,26,47,Other 17 2 0 (0.882352941 0.117647059) *
63) AREA_ORIG=14,24,UU 15 6 1 (0.400000000 0.600000000)
126) month=2,6,7,9 7 1 0 (0.857142857 0.142857143) *
127) month=1,4,10,12 8 0 1 (0.000000000 1.000000000) *
cp
to .003
"bad"? 是否必须将cp
减少到.003
“坏”? cp
, from what I read, is a parameter that is used to decide when to stop adding more leaves to the tree (for a node to be considered for another split, the improvement of the relative error by allowing a new split must by more than that cp
threshold). 从我的理解cp
, cp
是一个参数,该参数用于决定何时停止向树中添加更多叶子(对于要考虑用于另一个拆分的节点,通过允许新拆分来改善相对误差必须大于cp
阈值)。 Thus, the lower the number, the more leaves it can add. 因此,数字越小,可以添加的叶子就越多。 More observations implies that there is an opportunity to lower the threshold, I'm not sure I understand that you "have to" reduce cp
... but I could be wrong. 更多的观察表明存在降低该阈值的机会,我不确定我是否理解您“必须”降低cp
...,但我可能是错的。 If this is a very rare event and your data doesn't lend itself to showing significant improvement in the early stages of the model, it may require that you "increase the sensitivity" by lowering the cp
... but you probably know your data better than me. 如果这是非常罕见的事件,并且您的数据不能在模型的早期阶段表现出明显的改善,则可能需要通过降低cp
“提高灵敏度” ...但是您可能知道您的数据比我好。 cp
the more likely you are to overfit to the bias of your sample. 如果这不是罕见事件,则cp
越低,您越有可能过度适应样本偏差。 I don't think that minbucket=1
ever leads to a model that is interpretable, either... for similar reasons. 我认为minbucket=1
不会导致可解释的模型,因为类似的原因。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.