简体   繁体   English

R中的RandomForest回归

[英]RandomForest for Regression in R

I'm experimenting with R and the randomForest Package, I have some experience with SVM and Neural Nets. 我正在试验R和randomForest软件包,我对SVM和神经网络有一定的经验。 My first test is to try and regress: sin(x)+gaussian noise. 我的第一个测试是尝试回归:sin(x)+高斯噪声。 With Neural Nets and svm I obtain a "relatively" nice approximation of sin(x) so the noise is filtered out and the learning algorithm doesn't overfit. 使用神经网络和svm,我获得了sin(x)的“相对”近似值,因此可以滤除噪声,并且学习算法不会过拟合。 (for decent parameters) When doing the same on randomForest I have a completely overfitted solution. (对于体面的参数)在randomForest上执行相同操作时,我有一个完全适合的解决方案。 I simply use (R 2.14.0, tried on 2.14.1 too, just in case): 我只是使用(R 2.14.0,也在2.14.1上尝试过,以防万一):

library("randomForest")
x<-seq(-3.14,3.14,by=0.00628)
noise<-rnorm(1001)
y<-sin(x)+noise/4
mat<-matrix(c(x,y),ncol=2,dimnames=list(NULL,c("X","Y")))
plot(x,predict(randomForest(Y~.,data=mat),mat),col="green")
points(x,y)

I guess there is a magic option in randomForest to make it work correctly, I tried a few but I did not find the right lever to pull... 我想randomForest中有一个魔术选项可以使其正常工作,我尝试了一些,但没有找到正确的拉杆...

You can use maxnodes to limit the size of the trees, as in the examples in the manual. 您可以使用maxnodes来限制树的大小,如手册中的示例所示。

r <- randomForest(Y~.,data=mat, maxnodes=10)
plot(x,predict(r,mat),col="green")
points(x,y)

You can do a lot better (rmse ~ 0.04, $R^2$ > 0.99) by training individual trees on small samples or bites as Breiman called them 通过在小样本或叮咬上训练单个树(如Breiman所说的那样),您可以做得更好(rmse〜0.04,$ R ^ 2 $> 0.99)

Since there is a significant amount of noise in the training data, this problem is really about smoothing rather than generalization. 由于训练数据中存在大量噪声,因此此问题实际上是平滑而不是泛化。 In general machine learning terms this requires increasing regularization. 在一般的机器学习术语中,这需要增加正则化。 For ensemble learner this means trading strength for diversity. 对于整体学习者来说,这意味着为了多样性而交易实力。

Diversity of randomForests can be increasing by reducing the number of candidate feature per split ( mtry in R) or the training set of each tree ( sampsize in R). randomForests的多样性可以通过减少候选特征的每分裂(数量来增加mtry在R)或每个树的训练集( sampsize在R)。 Since there is only 1 input dimesions, mtry does not help, leaving sampsize . 由于只有1个输入mtry ,因此mtry并没有帮助,只剩下sampsize This leads to a 3.5x improvement in RMSE over the default settings and >6x improvement over the noisy training data itself. 与默认设置相比,RMSE可以提高3.5倍,而对于嘈杂的训练数据本身,则可以提高6倍以上。 Since increased divresity means increased variance in the prediction of the individual learners, we also need to increase the number of trees to stabilize the ensemble prediction. 由于增加的多样性意味着个体学习者的预测中的变化增加,因此我们还需要增加树的数量来稳定整体预测。

small bags, more trees :: rmse = 0.04 : 小袋子,更多树 :: rmse = 0.04

>sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
                         replace=FALSE, ntree=5000),
            mat)
    - sin(x))
[1] 0.03912643

default settings :: rmse=0.14 : 默认设置 :: rmse = 0.14

> sd(predict(randomForest(Y~.,data=mat),mat) - sin(x))
[1] 0.1413018

error due to noise in training set :: rmse = 0.25 训练集中的噪声引起的误差 :: rmse = 0.25

> sd(y - sin(x))
[1] 0.2548882

The error due to noise is of course evident from 由噪声引起的误差从

noise<-rnorm(1001)
y<-sin(x)+noise/4

In the above the evaluation is being done against the training set, as it is in the original question. 在上面,评估是针对训练集进行的,就像原来的问题一样。 Since the issue is smoothing rather than generalization, this is not as egregious as it may seem, but it is reassuring to see that out of bag evaluation shows similar accuracy: 由于问题是平滑而不是泛化,因此它看起来并不那么令人震惊,但令人放心的是,袋外评估显示出相似的准确性:

> sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
                          replace=FALSE, ntree=5000))
     - sin(x))
[1] 0.04059679

My intuition is that: 我的直觉是:

  • if you had a simple decision tree to fit a 1 dimensional curve f(x), that would be equivalent to fit with a staircase function (not necessarily with equally spaced jumps) 如果您有一个简单的决策树来拟合一维曲线f(x),那将等效于使用阶梯函数(不一定需要等距跳跃)
  • with random forests you will make a linear combination of staircase functions 在随机森林中,您将使楼梯函数线性组合

For a staircase function to be a good approximator of f(x), you want enough steps on the x axis, but each step should contain enough points so that their mean is a good approximation of f(x) and less affected by noise. 为了使阶梯函数成为f(x)的良好逼近器,您需要在x轴上有足够的步长,但是每个步长都应包含足够的点,以使它们的均值很好地近似于f(x),并且不受噪声的影响较小。

So I suggest you tune the nodesize parameter. 因此,我建议您调整nodesize参数。 If you have 1 decision tree and N points, and nodesize=n, then your staircase function will have N/n steps. 如果您有1个决策树和N个点,并且结点大小= n,则阶梯函数将具有N / n个步长。 n too small brings to overfitting. n太小会导致过拟合。 I got nice results with n~30 (RMSE~0.07): 我得到了n〜30(RMSE〜0.07)的不错结果:

r <- randomForest(Y~.,data=mat, nodesize=30)
plot(x,predict(r,mat),col="green")
points(x,y)

Notice that RMSE gets smaller if you take N'=10*N and n'=10*n. 注意,如果取N'= 10 * N且n'= 10 * n,则RMSE会变小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM