简体   繁体   English

R中xgboost回归的置信区间

[英]Confidence interval for xgboost regression in R

I am currently working on a dataset that contains 4 categorical input variables and one numeric output. 我目前正在处理一个包含4个分类输入变量和一个数字输出的数据集。

I created a model, using the xgboost package in R. But I cannot find a way to compute a CI. 我使用R中的xgboost包创建了一个模型。但是我找不到计算CI的方法。

How can I compute the confidence interval for my predictions? 如何计算预测的置信区间? I found this answer to a classification problem , but I do not understand it properly. 我找到了解决分类问题的答案 ,但是我对它的理解不正确。 Could someone explain it in more depth for my problem? 有人可以为我的问题做更深入的解释吗?

From what I can tell, there isnt a direct way to compute this using the xgboost package. 据我所知,没有直接的方法可以使用xgboost软件包进行计算。

The linked article you gave gives a framework for how you could go about do it. 您提供的链接文章为如何实现提供了框架。 It references doing "bagging" , which basically means creating the same model many times (that has randomness in it). 它引用执行“装袋”,这基本上意味着多次创建相同的模型(其中具有随机性)。 For xgboost, if you were to set the colsample_bytree (what random selection of columns to use in each tree) to < 1 and subsample (what random percent of rows to use in each tree) < 1 , then this will introduce a "random element" to the model. 对于xgboost,如果将colsample_bytree(每棵树中要使用的列的随机选择)设置为<1,并将subsample(每棵树中要使用的行的随机百分比)设置为<1,则这将引入“随机元素”。

If you set the above variables to less than 1, you would have a model with a random element. 如果将上述变量设置为小于1,则将有一个带有随机元素的模型。 If you were to run this model 100 different times, each time with a different seed value, you would end up with 100 unique xgboost models technically, with 100 different predictions for each observation. 如果要对该模型运行100次不同的时间,每次使用不同的种子值,那么从技术上讲,您将最终获得100个唯一的xgboost模型,每个观察值具有100个不同的预测。 Using these 100 predictions, you could come up with a custom confidence interval using the mean and standard deviation of the 100 predictions. 使用这100个预测,您可以使用100个预测的均值和标准差得出自定义的置信区间。

I cant vouch for how effective or reliable these custom confidence intervals would be, but if you wanted to follow the example in the linked article this how you would do it, and this is the explanation of what they were talking about. 我不能保证这些自定义置信区间的有效性或可靠性,但是如果您想按照链接的文章中的示例进行操作,这将是他们在说什么的解释。

Here is some sample code for doing this, assuming you have 500 observations: 假设您有500次观察,下面是一些执行此操作的示例代码:

##make an empty data frame with a column per bagging run
predictions <- data.frame(matrix(0,500,100))

library(xgboost)

##come up with 100 unique seed values that you can reproduce
set.seed(123)
seeds <- runif(100,1,100000)

for (i in 1:ncol(predictions){

set.seed(seeds[i])
xgb_model <- xgboost(data = train,
                     label = y,
                     objective = "reg:linear",
                     eval_metric = "rmse",
                     subsample = .8,
                     colsample_bytree = .8
                     )

predictions[,i] <- predict(xgb_model,newdata = test)

}

A great option to get the quantiles from a xgboost regression is described in this blog post. 在此博客文章中介绍了一种从xgboost回归中获取分位数的好方法。 I believe this is a more elegant solution than the other method suggest in the linked question (for regression). 我相信这是比链接的问题(用于回归)中建议的其他方法更为优雅的解决方案。

https://www.bigdatarepublic.nl/regression-prediction-intervals-with-xgboost/ https://www.bigdatarepublic.nl/regression-prediction-intervals-with-xgboost/

Basically your problem can be described as followed (from the blog): 基本上,您的问题可以描述如下(来自博客):

In the case that the quantile value q is relatively far apart from the observed values within the partition, then because of the Gradient and Hessian both being constant for large difference x_i-q, the score stays zero and no split occurs. 在分位数值q与分区内的观测值相距较远的情况下,由于Gradient和Hessian对于大差异x_i-q都是恒定的,因此分数保持为零,并且不会发生分裂。

Then the following solution is suggested: 然后提出以下解决方案:

An interesting solution is to force a split by adding randomization to the Gradient. 一个有趣的解决方案是通过在Gradient中添加随机化来强制分割。 When the differences between the observations x_i and the old quantile estimates q within partition are large, this randomization will force a random split of this volume. 当观测值x_i与分区内的旧分位数估计q之间的差异较大时,此随机化将强制对此体积进行随机分割。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM