使用插入包来找到GBM的最佳参数

Question

I'm using the R GBM package for boosting to do regression on some biological data of dimensions 10,000 X 932 and I want to know what are the best parameters settings for GBM package especially (n.trees, shrinkage, interaction.depth and n.minobsinnode) when I searched online I found that CARET package on R can find such parameter settings. 我正在使用R GBM软件包来增强对尺寸为10,000 X 932的一些生物数据进行回归，我想知道什么是GBM软件包的最佳参数设置（n.trees，shrinkage，interaction.depth和n。 minobsinnode）当我在网上搜索时，我发现R上的CARET包可以找到这样的参数设置。 However, I have difficulty on using the Caret package with GBM package, so I just want to know how to use caret to find the optimal combinations of the previously mentioned parameters ? 但是，我在使用带有GBM包的Caret包时遇到了困难，所以我只想知道如何使用插入符找到前面提到的参数的最佳组合？ I know this might seem very typical question, but I read the caret manual and still have difficulty in integrating caret with gbm, especially cause I'm very new to both of these packages 我知道这似乎是一个非常典型的问题，但是我读了插入手册并且仍然难以将插入符号与gbm集成，特别是因为我对这两个包都很新

Answer 1

Not sure if you found what you were looking for, but I find some of these sheets less than helpful. 不确定你是否找到了你想要的东西，但我发现其中一些不太有帮助。

If you are using the caret package, the following describes the required parameters: > getModelInfo()$gbm$parameters 如果您使用的是插入符号包，则下面描述了所需的参数：> getModelInfo（）$ gbm $ parameters

He are some rules of thumb for running GBM: 他是运行GBM的一些经验法则：

The interaction.depth is 1, and on most data sets that seems adequate, but on a few I have found that testing the results against odd multiples up the max has given better results. interaction.depth为1，并且在大多数数据集上似乎已经足够了，但是在一些数据集中，我发现在奇数倍数上测试结果最大值已经给出了更好的结果。 The max value I have seen for this parameter is floor(sqrt(NCOL(training))). 我在这个参数中看到的最大值是floor（sqrt（NCOL（training）））。
Shrinkage: the smaller the number, the better the predictive value, the more trees required, and the more computational cost. 收缩：数量越小，预测值越好，所需树木越多，计算成本越高。 Testing the values on a small subset of data with something like shrinkage = shrinkage = seq(.0005, .05,.0005) can be helpful in defining the ideal value. 使用shrinkage = shrinkage = seq（.0005，.05，.0005）测试一小部分数据的值可能有助于定义理想值。
n.minobsinnode: default is 10, and generally I don't mess with that. n.minobsinnode：默认值是10，一般情况下我不会搞砸。 I have tried c(5,10,15,20) on small sets of data, and didn't really see an adequate return for computational cost. 我在小数据集上尝试了c（5,10,15,20），并没有真正看到计算成本的足够回报。
n.trees: the smaller the shrinkage, the more trees you should have. n.trees：收缩越小，你应该拥有越多的树木。 Start with n.trees = (0:50)*50 and adjust accordingly. 从n.trees =（0:50）* 50开始并相应调整。

Example setup using the caret package: 使用插入符包的示例设置：

getModelInfo()$gbm$parameters
library(parallel)
library(doMC)
registerDoMC(cores = 20)
# Max shrinkage for gbm
nl = nrow(training)
max(0.01, 0.1*min(1, nl/10000))
# Max Value for interaction.depth
floor(sqrt(NCOL(training)))
gbmGrid <-  expand.grid(interaction.depth = c(1, 3, 6, 9, 10),
                    n.trees = (0:50)*50, 
                    shrinkage = seq(.0005, .05,.0005),
                    n.minobsinnode = 10) # you can also put something        like c(5, 10, 15, 20)

fitControl <- trainControl(method = "repeatedcv",
                       repeats = 5,
                       preProcOptions = list(thresh = 0.95),
                       ## Estimate class probabilities
                       classProbs = TRUE,
                       ## Evaluate performance using
                       ## the following function
                       summaryFunction = twoClassSummary)

# Method + Date + distribution
set.seed(1)
system.time(GBM0604ada <- train(Outcome ~ ., data = training,
            distribution = "adaboost",
            method = "gbm", bag.fraction = 0.5,
            nTrain = round(nrow(training) *.75),
            trControl = fitControl,
            verbose = TRUE,
            tuneGrid = gbmGrid,
            ## Specify which metric to optimize
            metric = "ROC"))

Things can change depending on your data (like distribution), but I have found the key being to play with gbmgrid until you get the outcome you are looking for. 事情可能会根据您的数据（如分布）而改变，但我发现关键是要使用gbmgrid，直到您获得所需的结果。 The settings as they are now would take a long time to run, so modify as your machine, and time will allow. 现在的设置需要很长时间才能运行，因此请修改为您的机器，并且时间允许。 To give you a ballpark of computation, I run on a Mac PRO 12 core with 64GB of ram. 为了给你一个计算范围，我运行Mac PRO 12核心，64GB内存。

Answer 2

This link has a concrete example (page 10) - http://www.jstatsoft.org/v28/i05/paper 这个链接有一个具体的例子（第10页） - http://www.jstatsoft.org/v28/i05/paper

Basically, one should first create a grid of candidate values for hyper parameters (like n.trees, interaction.depth and shrinkage). 基本上，首先应该为超参数创建候选值网格（如n.trees，interaction.depth和shrinkage）。 Then call the generic train function as usual. 然后像往常一样调用通用列车功能。

使用插入包来找到GBM的最佳参数

问题描述

2 个解决方案

解决方案1
20 2016-06-06 01:20:40

解决方案2
15 已采纳 2013-03-26 08:35:10

使用插入包来找到GBM的最佳参数

问题描述

2 个解决方案

解决方案1 20 2016-06-06 01:20:40

解决方案2 15 已采纳 2013-03-26 08:35:10

解决方案1
20 2016-06-06 01:20:40

解决方案2
15 已采纳 2013-03-26 08:35:10