简体   繁体   English

使用rpart(R包)对观测结果对决策树预测的影响

[英]Effects of Observations on Decision Tree Prediction using rpart (R package)

I'm very new to machine learning so I apologize if the answer to this is very obvious. 新的机器学习,所以我很抱歉,如果这个问题的答案是非常明显的。

I'm using a decision tree, using the rpart package, to attempt to predict when a structure fire may result in a fatality using a variety of variables related to that structure fire such as what was the cause, the extent of damage etc. 我正在使用决策树,并使用rpart包来尝试使用与该结构火灾相关的各种变量来预测结构火灾何时可能导致死亡,例如原因,损坏程度等。

The chance of a fatality resulting from structure fire is about 1 in 100. 因建筑物起火而导致死亡的几率约为100分之一。

In short I have about 154,000 observations in my training set. 简而言之,我的训练集中有大约154,000个观察值。 I have noticed that when I use the full training set, that the complexity parameter cp has to be reduced all the way down to .0003 . 我注意到,当我使用完整的训练集时,必须将复杂度参数cp一直降低到.0003

> rpart(Fatality~.,data=train_val,method="class", control=rpart.control(minsplit=50,minbucket = 1, cp=0.00035))
n= 154181 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

  1) root 154181 1881 0 (0.987800053 0.012199947)  
    2) losscat=Minor_Loss,Med_Loss 105538  567 0 (0.994627528 0.005372472) *
    3) losscat=Major_Loss,Total_Loss 48643 1314 0 (0.972986863 0.027013137)  
      6) HUM_FAC_1=3,6,N, 46102 1070 0 (0.976790595 0.023209405) *
      7) HUM_FAC_1=1,2,4,5,7 2541  244 0 (0.903974813 0.096025187)  
       14) AREA_ORIG=21,24,26,47,72,74,75,76,Other 1846  126 0 (0.931744312 0.068255688)  
         28) CAUSE_CODE=1,2,5,6,7,8,9,10,12,14,15 1105   45 0 (0.959276018 0.040723982) *
         29) CAUSE_CODE=3,4,11,13,16 741   81 0 (0.890688259 0.109311741)  
           58) FIRST_IGN=10,12,15,17,18,Other,UU 690   68 0 (0.901449275 0.098550725) *
           59) FIRST_IGN=00,21,76,81 51   13 0 (0.745098039 0.254901961)  
            118) INC_TYPE=111,121 48   10 0 (0.791666667 0.208333333) *
            119) INC_TYPE=112,120 3    0 1 (0.000000000 1.000000000) *
       15) AREA_ORIG=14,UU 695  118 0 (0.830215827 0.169784173)  
         30) CAUSE_CODE=1,2,4,7,8,10,11,12,13,14,15,16 607   86 0 (0.858319605 0.141680395) *
         31) CAUSE_CODE=3,5,6,9 88   32 0 (0.636363636 0.363636364)  
           62) HUM_FAC_1=1,2 77   24 0 (0.688311688 0.311688312) *
           63) HUM_FAC_1=4,5,7 11    3 1 (0.272727273 0.727272727) *

However, when I just grab the first 10,000 observations (no meaningful order) I can now run with a cp of .01 但是,当我只获取前10,000个观测值(无有意义的顺序)时,我现在可以以0.01的cp运行

> rpart(Fatality~., data = test, method = "class", 
+       control=rpart.control(minsplit=10,minbucket = 1, cp=0.01))
n= 10000 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

  1) root 10000 112 0 (0.988800000 0.011200000)  
    2) losscat=Minor_Loss,Med_Loss 6889  26 0 (0.996225867 0.003774133) *
    3) losscat=Major_Loss,Total_Loss 3111  86 0 (0.972356156 0.027643844)  
      6) HUM_FAC_1=3,7,N 2860  66 0 (0.976923077 0.023076923) *
      7) HUM_FAC_1=1,2,4,5,6 251  20 0 (0.920318725 0.079681275)  
       14) CAUSE_CODE=1,3,4,6,7,8,9,10,11,14,15 146   3 0 (0.979452055 0.020547945) *
       15) CAUSE_CODE=5,13,16 105  17 0 (0.838095238 0.161904762)  
         30) weekday=Friday,Monday,Saturday,Tuesday,Wednesday 73   6 0 (0.917808219 0.082191781) *
         31) weekday=Sunday,Thursday 32  11 0 (0.656250000 0.343750000)  
           62) AREA_ORIG=21,26,47,Other 17   2 0 (0.882352941 0.117647059) *
           63) AREA_ORIG=14,24,UU 15   6 1 (0.400000000 0.600000000)  
            126) month=2,6,7,9 7   1 0 (0.857142857 0.142857143) *
            127) month=1,4,10,12 8   0 1 (0.000000000 1.000000000) *
  1. Why is it that a greater number of observations is resulting in me having to reduce complexity? 为什么要进行大量观察导致我不得不降低复杂性? Intuitively I would think it should be opposite. 凭直觉,我认为应该是相反的。
  2. Is having to reduce cp to .003 "bad"? 是否必须将cp减少到.003 “坏”?
  3. Generally, is there any other advice for improving the effectiveness of a decision tree, especially when predicting something that has such low probability in the first place? 通常,是否还有其他建议可以提高决策树的效率,特别是在首先预测具有如此低概率的事物时?
  1. cp , from what I read, is a parameter that is used to decide when to stop adding more leaves to the tree (for a node to be considered for another split, the improvement of the relative error by allowing a new split must by more than that cp threshold). 从我的理解cpcp是一个参数,该参数用于决定何时停止向树中添加更多叶子(对于要考虑用于另一个拆分的节点,通过允许新拆分来改善相对误差必须大于cp阈值)。 Thus, the lower the number, the more leaves it can add. 因此,数字越小,可以添加的叶子就越多。 More observations implies that there is an opportunity to lower the threshold, I'm not sure I understand that you "have to" reduce cp ... but I could be wrong. 更多的观察表明存在降低该阈值的机会,我不确定我是否理解您“必须”降低cp ...,但我可能是错的。 If this is a very rare event and your data doesn't lend itself to showing significant improvement in the early stages of the model, it may require that you "increase the sensitivity" by lowering the cp ... but you probably know your data better than me. 如果这是非常罕见的事件,并且您的数据不能在模型的早期阶段表现出明显的改善,则可能需要通过降低cp “提高灵敏度” ...但是您可能知道您的数据比我好。
  2. If you're modeling a rare event, no. 如果您要对罕见事件进行建模,则不会。 If it's not a rare event, the lower your cp the more likely you are to overfit to the bias of your sample. 如果这不是罕见事件,则cp越低,您越有可能过度适应样本偏差。 I don't think that minbucket=1 ever leads to a model that is interpretable, either... for similar reasons. 我认为minbucket=1不会导致可解释的模型,因为类似的原因。
  3. Decision Trees, to me, don't make very much sense beyond 3-4 levels unless you really believe that these hard cuts truly create criteria that justify a final "bucket"/node or a prediction (eg if I wanted to bucket you into something financial like a loan or insurance product that fits your risk profile, and my actuaries made hard cuts to split the prospects). 对我来说,决策树在3-4个级别上没有太大意义,除非您真的相信这些硬性削减确实创建了可以证明最终“存储桶” /节点或预测合理的标准(例如,如果我想将您存储到类似于贷款或保险产品的金融产品,适合您的风险状况,而我的精算师为削减潜在客户进行了大刀阔斧的削减)。 After you've split your data 3-4 times, producing a minimum of 8-16 nodes at the bottom of your tree, you've essentially built a model that could be thought of as 3rd or 4th order interactions of independent categorical variables. 将数据拆分3-4次,并在树的底部至少生成8-16个节点之后,就可以构建一个模型,该模型可以看作是独立类别变量的三阶或四阶交互。 If you put 20 statisticians (not econo-missed's) in a room and ask them about the number of times they've seen significant 3rd or 4th order interactions in a model, they'd probably scratch their heads. 如果您将20位统计学家(并非经济缺失者)放在一个房间里,并询问他们在模型中看到重要的三阶或四阶相互作用的次数,那么他们可能会挠头。 Have you tried any other methods? 您是否尝试过其他方法? Or started with dimension reduction? 还是从缩小尺寸开始? More importantly, what inferences are you trying to make about the data? 更重要的是,您要对数据做出哪些推断?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM