简体   繁体   English

XGBoost(R)CV测试与训练错误

[英]XGBoost (R) CV test vs. training error

I'll preface my question by saying that I am, currently, unable to share my data due to extremely strict confidentiality agreements surrounding it. 首先,我会说,由于围绕数据的严格保密协议,我目前无法共享我的数据。 Hopefully I'll be able to get permission to share the blinded data shortly. 希望我不久就能获得共享盲数据的许可。

I am struggling to get XGBoost trained properly in R. I have been following the guide here and am so far stuck on step 1, tuning the nrounds parameter. 我正在努力使XGBoost在R中得到正确的培训。我一直在这里按照指南进行操作,到目前为止,我们仍然停留在步骤1上,调整nrounds参数。 The results I'm getting from my cross validation aren't doing what I'd expect them to do leaving me at a loss for where to proceed. 我从交叉验证中获得的结果并没有达到我期望他们做的结果,这使我对下一步工作感到困惑。

My data contains 105 obervations, a continuous response variable (histogram in the top left pane of the image in the link below) and 16095 predictor variables. 我的数据包含105个观测值,一个连续响应变量(在下面的链接中图像左上角的直方图)和16095个预测变量。 All of the predictors are on the same scale and a histogram of them all is in the top right pane of the image in the link below. 所有预测变量的比例都相同,并且直方图都位于下面链接中图像的右上窗格中。 The predictor variables are quite zero heavy with 62.82% of all values being 0. 预测变量的重量相当为零,所有值的62.82%为0。

As a separate set of test data I have a further 48 observations. 作为一组单独的测试数据,我还有48个观察值。 Both data sets have a very similar range in their response variables. 这两个数据集的响应变量范围都非常相似。

在此处输入图片说明

So far I've been able to fit a PLS model and a Random Forest (using the R library ranger). 到目前为止,我已经能够拟合PLS模型和随机森林(使用R库管理程序)。 Applying these two models to my test data set I've been able to predict and get a RMSE of 19.133 from PLS and 15.312 from ranger. 将这两个模型应用于我的测试数据集,我已经能够从PLS预测到19.133的RMSE,从测距仪得到RMSE的15.312。 In the case of ranger successive model fits are proving very stable using 2000 trees and 760 variables each split. 在游侠的情况下,使用2000棵树和760个变量分别证明了连续模型拟合非常稳定。

Returning to XGBoost, using the code below, I have been fixing all parameters except nrounds and using the xgb.cv function in the R package xgboost to calculate the training and test errors. 回到XGBoost,使用下面的代码,我一直在修复除nrounds之外的所有参数,并使用R包xgboost中的xgb.cv函数来计算训练和测试错误。

data.train<-read.csv("../Data/Data_Train.csv")
data.test<-read.csv("../Data/Data_Test.csv")

dtrain <- xgb.DMatrix(data = as.matrix(data.train[,-c(1)]), 
label=data.train[,1])
# dtest <- xgb.DMatrix(data = as.matrix(data.test[,-c(1)]), label=data.test[,1]) # Not used here

## Step 1 - tune number of trees using CV function

  eta = 0.1; gamma = 0; max_depth = 15;
  min_child_weight = 1; subsample = 0.8; colsample_bytree = 0.8
  nround=2000
  cv <- xgb.cv(
    params = list(
      ## General Parameters
      booster = "gbtree", # Default
      silent = 0, # Default

      ## Tree Booster Parameters
      eta = eta,
      gamma = gamma,
      max_depth = max_depth,
      min_child_weight = min_child_weight,
      subsample = subsample,
      colsample_bytree = colsample_bytree,
      num_parallel_tree = 1, # Default

      ## Linear Booster Parameters
      lambda = 1, # Default
      lambda_bias = 0, # Default
      alpha = 0, # Default

      ## Task Parameters
      objective = "reg:linear", # Default
      base_score = 0.5, # Default
      # eval_metric = , # Evaluation metric, set based on objective
      nthread = 60
    ),
    data = dtrain,
    nround = nround,
    nfold = 5,
    stratified = TRUE,
    prediction = TRUE,
    showsd = TRUE,
    # early_stopping_rounds = 20,
    # maximize = FALSE,
    verbose = 1
  )

library(ggplot)
plot.df<-data.frame(NRound=as.matrix(cv$evaluation_log)[,1], Train=as.matrix(cv$evaluation_log)[,2], Test=as.matrix(cv$evaluation_log)[,4])
library(reshape2)
plot.df<-melt(plot.df, measure.vars=2:3)
ggplot(data=plot.df, aes(x=NRound, y=value, colour=variable)) + geom_line() + ylab("Mean RMSE")

If this function does what I believe it is does I was hoping to see the training error decrease to a plateau and the test error to decrease then begin to increase again as the model overfits. 如果此功能确实达到了我的预期,那么我希望看到训练误差减少到平稳状态,而测试误差减少,然后随着模型过拟合而再次开始增加。 However the output I'm getting looks like the code below (and also the lower figure in the link above). 但是我得到的输出看起来像下面的代码(以及上面链接中的下图)。

##### xgb.cv 5-folds
    iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
       1      94.4494006   1.158343e+00       94.55660      4.811360
       2      85.5397674   1.066793e+00       85.87072      4.993996
       3      77.6640230   1.123486e+00       78.21395      4.966525
       4      70.3846390   1.118935e+00       71.18708      4.759893
       5      63.7045868   9.555162e-01       64.75839      4.668103
---                                                                 
    1996       0.0002458   8.158431e-06       18.63128      2.014352
    1997       0.0002458   8.158431e-06       18.63128      2.014352
    1998       0.0002458   8.158431e-06       18.63128      2.014352
    1999       0.0002458   8.158431e-06       18.63128      2.014352
    2000       0.0002458   8.158431e-06       18.63128      2.014352

Considering how well ranger works I'm inclined to believe that I'm doing something foolish and causing XGBoost to struggle! 考虑到护林员的工作状况,我倾向于认为我做的事很愚蠢,并导致XGBoost挣扎!

Thanks 谢谢

To tune your parameters you can use tuneParams . 要调整参数,可以使用tuneParams Here is an example 这是一个例子

 task = makeClassifTask(id = id, data = "your data", target = "the name of the column in your data of the y variable")

  # Define the search space
  tuning_options <- makeParamSet(                                     
    makeNumericParam("eta",              lower = 0.1,         upper = 0.4), 
    makeNumericParam("colsample_bytree", lower = 0.5,         upper = 1), 
    makeNumericParam("subsample",        lower = 0.5,         upper = 1),  
    makeNumericParam("min_child_weight", lower = 3,           upper = 10),    
    makeNumericParam("gamma",            lower = 0,           upper = 10), 
    makeNumericParam("lambda",           lower = 0,           upper = 5), 
    makeNumericParam("alpha",            lower = 0,           upper = 5),
    makeIntegerParam("max_depth",        lower = 1,           upper = 10),
    makeIntegerParam("nrounds",          lower = 50,          upper = 300))

 ctrl = makeTuneControlRandom(maxit = 50L)
  rdesc = makeResampleDesc("CV", iters = 3L)
  learner = makeLearner("classif.xgboost", predict.type = "response",par.vals = best_param)

 res = tuneParams(learner = learner,task = task, resampling = rdesc,
                   par.set = tuning_options, control = ctrl,measures = acc)

Of course you can play around with the intervals for your parameters. 当然,您可以使用参数间隔。 In the end res will contain the optimal set of parameters for your xgboost and then you can train your xgboost using this parameters. 最后, res将包含xgboost的最佳参数集,然后您可以使用此参数训练xgboost Keep in mind that you can choose other method except apart from cross-validation, try ?makeResampleDesc 请记住,除了交叉验证之外,您还可以选择其他方法,请尝试?makeResampleDesc

I hope it helps 希望对您有所帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM