简体   繁体   English

随机森林模型中预测结果的差异

[英]Difference of prediction results in random forest model

I have built an Random Forest model and I got two different prediction results when I wrote two different lines of code in order to generate the prediction. 我建立了一个随机森林模型,并编写了两行不同的代码以生成预测时,得到了两个不同的预测结果。 I wonder which one is the right one. 我不知道哪一个是正确的。 Here is my example dataframe and the usedcode: 这是我的示例数据框和usedcode:

dat <- read.table(text = " cats birds    wolfs     snakes
      0        3        9         7
      1        3        8         4
      1        1        2         8
      0        1        2         3
      0        1        8         3
      1        6        1         2
      0        6        7         1
      1        6        1         5
      0        5        9         7
      1        3        8         7
      1        4        2         7
      0        1        2         3
      0        7        6         3
      1        6        1         1
      0        6        3         9
      1        6        1         1   ",header = TRUE)

I've built a random forest model: 我建立了一个随机森林模型:

model<-randomForest(snakes~cats+birds+wolfs,data=dat,ntree=20)
RF_pred<- data.frame(predict(model))
train<-cbind(train,RF_pred) # this gave me a predictive results named: "predict.model."

I tryed another syntax out of curiosity with this line of code: 我出于好奇而尝试了另一种语法:

dat$RF_pred<-predict(model,newdata=dat,type='response') # this gave me a predictive results named: "RF_pred"

to my suprise I got other predictive results: 令我惊讶的是,我得到了其他预测结果:

 dat
   cats birds wolfs snakes predict.model.  RF_pred
1     0     3     9      7       3.513889 5.400675
2     1     3     8      4       5.570000 5.295417
3     1     1     2      8       3.928571 5.092917
4     0     1     2      3       4.925893 4.208452
5     0     1     8      3       4.583333 4.014008
6     1     6     1      2       3.766667 2.943750
7     0     6     7      1       5.486806 4.061508
8     1     6     1      5       3.098148 2.943750
9     0     5     9      7       4.575397 5.675675
10    1     3     8      7       4.729167 5.295417
11    1     4     2      7       4.416667 5.567917
12    0     1     2      3       4.222619 4.208452
13    0     7     6      3       6.125714 4.036508
14    1     6     1      1       3.695833 2.943750
15    0     6     3      9       4.115079 5.178175
16    1     6     1      1       3.595238 2.943750

Why Is there a diff. 为什么会有差异。 between the two? 两者之间? Which one is the correct one? 哪一个是正确的? Any Ideas? 有任何想法吗?

The difference is in the two calls to predict: 区别在于预测的两个调用:

predict(model)

and

predict(model, newdata=dat)

The first option gets the out-of-bag predictions on your training data from the random forest. 第一种选择是从随机森林中获得关于您的训练数据的即时预测。 This is generally what you want, when comparing predicted values to actuals. 将预测值与实际值进行比较时,通常这就是您想要的。

The second treats your training data as if it was a new dataset, and runs the observations down each tree. 第二个方法将您的训练数据视为新数据集,然后将观察结果沿着每棵树运行。 This will result in an artificially close correlation between the predictions and the actuals, since the RF algorithm generally doesn't prune the individual trees, relying instead on the ensemble of trees to control overfitting. 这将导致预测值与实际值之间人为地紧密相关,因为RF算法通常不会修剪单个树,而是依靠树的集合来控制过度拟合。 So don't do this if you want to get predictions on the training data. 因此,如果您想对训练数据进行预测,请不要这样做。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM