简体   繁体   中英

How can I train a glmnet model (Poisson family) with an offset term using the caret package in R?

I want to model insurance claim count using a Poisson glmnet. The data I have at hand contains the number of claims for each policy (which is the response variable), some features about the policy (gender, region, etc.) as well as the duration of the policy (in years). I want to include the log-duration as an offset term, as we usually do in actuarial science. With the cv.glmnet function of the glmnet package, it is straightforward:

library(tidyverse)
library(glmnet)

n <- 100

dat <- tibble(
 nb_claims = rpois(n, lambda = 0.5),
 duration = runif(n),
 x1 = runif(n),
 x2 = runif(n),
 x3 = runif(n)
)


fit <- cv.glmnet(
  x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
  y = dat %>% pull(nb_claims),
  family = "poisson",
  offset = dat %>% pull(duration) %>% log()
)

fit

However, my goal is to train this model using the train function of the caret package, because of the many advantages it gives. Indeed, validation, preprocessing as well as feature selection is much better with this package. It is straightforward to train a basic glmnet (without an offset term) with caret :

library(caret)

fit <- caret::train(
  x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
  y = dat %>% pull(nb_claims),
  method = "glmnet",
  family = "poisson"
)

fit

Naively, we could try to add the offset argument in the train function:

fit <- caret::train(
  x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
  y = dat %>% pull(nb_claims),
  method = "glmnet",
  family = "poisson",
  offset = dat %>% pull(duration) %>% log()
)

fit

Unfortunately, this code throws the error Error: No newoffset provided for prediction, yet offset used in fit of glmnet . This error occurs because the caret::train function doesn't take care to give a value for the newoffset argument in predict.glmnet function.

In this book , they show how to add an offset term to a GLM model by modifying the source code of the caret::train function. It works perfectly. However, the predict.glm function is quite different from the predict.glmnet function, because it does not have the newoffset argument. I tried to modify the source code of the caret::train function, but I am having some trouble because I do not know well enough how this function works.

A simple way to perform this is pass the offset column as part of x and in each fit and predict call pass as x columns of x which are not the offset . While as offset / newoffset pass the x column corresponding to the offset .

In the following example the offest column of x needs to be named "offset" too. This can be changed relatively easy

To create the function we will just use lots of parts from: https://github.com/topepo/caret/blob/master/models/files/glmnet.R

glmnet is peculiar since it needs a loop , the rest is just rinse and reapeat from https://topepo.github.io/caret/using-your-own-model-in-train.html#illustrative-example-1-svms-with-laplacian-kernels

family = "poisson" will be specified throughout, to change this adopt code from https://github.com/topepo/caret/blob/master/models/files/glmnet.R

glmnet_offset <- list(type = "Regression",
                      library = c("glmnet", "Matrix"),
                      loop = function(grid) {
                        alph <- unique(grid$alpha)
                        loop <- data.frame(alpha = alph)
                        loop$lambda <- NA
                        submodels <- vector(mode = "list", length = length(alph))
                        for(i in seq(along = alph)) {
                          np <- grid[grid$alpha == alph[i],"lambda"]
                          loop$lambda[loop$alpha == alph[i]] <- np[which.max(np)]
                          submodels[[i]] <- data.frame(lambda = np[-which.max(np)])
                        }
                        list(loop = loop, submodels = submodels)
                      }) 


glmnet_offset$parameters <- data.frame(parameter = c('alpha', 'lambda'),
                                       class = c("numeric", "numeric"),
                                       label = c('Mixing Percentage', 'Regularization Parameter'))


glmnet_offset$grid <- function(x, y, len = NULL, search = "grid") {
  if(search == "grid") {
    init <- glmnet::glmnet(Matrix::as.matrix(x[,colnames(x) != "offset"]), y,
                           family = "poisson",
                           nlambda = len+2,
                           alpha = .5,
                           offset = x[,colnames(x) == "offset"])
    lambda <- unique(init$lambda)
    lambda <- lambda[-c(1, length(lambda))]
    lambda <- lambda[1:min(length(lambda), len)]
    out <- expand.grid(alpha = seq(0.1, 1, length = len),
                       lambda = lambda)
  } else {
    out <- data.frame(alpha = runif(len, min = 0, 1),
                      lambda = 2^runif(len, min = -10, 3))
  }
  out
}

So x[,colnames(x) != "offset"] is x while offset is x[,colnames(x) == "offset"]

glmnet_offset$fit <- function(x, y, wts, param, last, ...) {

  theDots <- list(...)


  ## pass in any model weights
  if(!is.null(wts)) theDots$weights <- wts

  if(!(class(x)[1] %in% c("matrix", "sparseMatrix")))
    x <- Matrix::as.matrix(x)
  modelArgs <- c(list(x = x[,colnames(x) != "offset"],
                      y = y,
                      alpha = param$alpha,
                      family = "poisson",
                      offset = x[,colnames(x) == "offset"]),
                 theDots)

  out <- do.call(glmnet::glmnet, modelArgs)
  if(!is.na(param$lambda[1])) out$lambdaOpt <- param$lambda[1]
  out
}

glmnet_offset$predict <- function(modelFit, newdata, submodels = NULL) {
  if(!is.matrix(newdata)) newdata <- Matrix::as.matrix(newdata)
    out <- predict(modelFit,
                   newdata[,colnames(newdata) != "offset"],
                   s = modelFit$lambdaOpt,
                   newoffset = newdata[,colnames(newdata) == "offset"],
                   type = "response") #important for measures to be appropriate

  if(is.matrix(out)) out <- out[,1]
  out
  if(!is.null(submodels)) {
      tmp <- as.list(as.data.frame(predict(modelFit,
                                          newdata[,colnames(newdata) != "offset"],
                                          s = submodels$lambda,
                                          newoffset = newdata[,colnames(newdata) == "offset"],
                                          type = "response"),
                                   stringsAsFactors = TRUE))
    out <- c(list(out), tmp)

  }
  out

}

For some reason which I don't understand yet it does not work without the prob slot

glmnet_offset$prob <- glmnet_offset$predict


glmnet_offset$tags = c("Generalized Linear Model", "Implicit Feature Selection",
                       "L1 Regularization", "L2 Regularization", "Linear Classifier",
                       "Linear Regression")


glmnet_offset$sort = function(x) x[order(-x$lambda, x$alpha),]
glmnet_offset$trim = function(x) {
  x$call <- NULL
  x$df <- NULL
  x$dev.ratio <- NULL
  x
}

library(tidyverse)
library(caret)
library(glmnet)

n <- 100
set.seed(123)
dat <- tibble(
  nb_claims = rpois(n, lambda = 0.5),
  duration = runif(n),
  x1 = runif(n),
  x2 = runif(n),
  x3 = runif(n)
)

x = dat %>%
  dplyr::select(-nb_claims) %>%
  mutate(offset = log(duration)) %>%
  dplyr::select(-duration) %>%
  as.matrix

fit <- caret::train(
  x = x,
  y = dat %>% pull(nb_claims),
  method = glmnet_offset,
)

fit
100 samples
  4 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 
Resampling results across tuning parameters:

  alpha  lambda        RMSE       Rsquared    MAE      
  0.10   0.0001640335  0.7152018  0.01805762  0.5814200
  0.10   0.0016403346  0.7152013  0.01805684  0.5814193
  0.10   0.0164033456  0.7130390  0.01798125  0.5803747
  0.55   0.0001640335  0.7151988  0.01804917  0.5814020
  0.55   0.0016403346  0.7150312  0.01802689  0.5812936
  0.55   0.0164033456  0.7095996  0.01764947  0.5783706
  1.00   0.0001640335  0.7152033  0.01804795  0.5813997
  1.00   0.0016403346  0.7146528  0.01798979  0.5810811
  1.00   0.0164033456  0.7063482  0.01732168  0.5763653

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.01640335.

predict(fit$finalModel,  x[,1:3], newoffset = x[,4]) #works

This will not work with preprocessing in caret since we pass offset as one of the features. However it will work with recipes since you can define columns on which preprocessing functions will be performed via selections . Se article for details: https://tidymodels.github.io/recipes/articles/Selecting_Variables.html

I haven't had time to error check my code. If any problems occur or if there is a mistake somewhere please comment. Thanks.

You can also post an issue in caret github asking this feature (offset/newoffset) to be added to the model

I tried to change the model info a lot of ways, but it was failing miserably. Below I can propose one solution, may not be the best, but will get you somewhere if your data is sensible.

In the poisson / negative binom.. regression, the offset in factor gets introduced into the regression, you can read more here and here :

在此处输入图像描述

where tx is the offset. In glmnet, there is a penalty factor you can introduce for each term, and if you let that be 0 for a term, basically you are not penalizing it and it's always included. We can use that for the offset, and you can see this effect only if you use a dataset that makes some sense (note that in your example dataset, the offsets are numbers that make no sense).

Below I use the insurance claims dataset from MASS:

library(tidyverse)
library(glmnet)
library(MASS)

dat <- Insurance
X =  model.matrix(Claims ~ District + Group + Age,data=dat)
Y = dat$Claims
OFF = log(dat$Holders)

fit_cv <- cv.glmnet(
  x = X,
  y = Y,
  family = "poisson",
  offset = OFF
)

Now using caret, I will fit it without any training, and using the same lambda obtained from the fit in cv.glmnet. One thing you should note too is that cv.glmnet often uses lambda.1se instead of lambda.min:

fit_c <- caret::train(
  x = cbind(X,OFF),
  y = Y,
  method = "glmnet",
  family = "poisson",
  tuneGrid=data.frame(lambda=fit_cv$lambda.1se,alpha=1),
  penalty=c(rep(1,ncol(X)),0),
  trControl = trainControl(method="none")
)

We can see how different are the predictions:

p1 = predict(fit_cv,newx=X,newoffset=OFF)
p2 = predict(fit_c,newx=cbind(X,OFF))

plot(p1,p2)

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM