简体   繁体   中英

Using CARET together with GAM (“gamSpline” method) in R Poisson Regression

I am trying to use caret package to tune 'df' parameter of a gam model for my cohort analysis.

With the following data:

cohort = 1:60
age = 1:26
grid = data.frame(expand.grid(age = age, cohort = cohort))
size = data.frame(cohort = cohort, N = sample(100:150,length(cohort), replace = TRUE))
df = merge(grid, size, by = "cohort")

log_k = -3 + log(df$N) - 0.5*log(df$age) + df$cohort*(df$cohort-30)*(df$cohort-50)/20000 + runif(nrow(df),min = 0, max = 0.5)
df$conversion = rpois(nrow(df),exp(log_k))

Explanation of the data : Cohort number is the time of arrival of the potential customer. N is the number of potential customer that arrived at that time. Conversion is the number out of those potential customer that 'converted' (bought something). Age is the age (time spent from arrival) of the cohort when conversion took place. For a given cohort there are fewer conversions as age grows. This effect follows a power law. But the total conversion rate of each cohort can also change slowly in time (cohort number). Thus I want a smoothing spline of the time variable in my model.

I can fit a gam model from package gam

library(gam)
fit = gam(conversion ~ log(N) + log(age) + s(cohort, df = 4), data = df, family = poisson)
fit
> Call:
> gam(formula = conversion ~ log(N) + log(age) + s(cohort, df = 4), 
> family = poisson, data = df)

> Degrees of Freedom: 1559 total; 1553 Residual
> Residual Deviance: 1869.943 

But if i try to train the model using the CARET package

library(caret)
fitControl = trainControl(verboseIter = TRUE)
fit.crt = train(conversion ~ log(N) + log(age) + s(cohort,df),
            data = df, method = "gamSpline",
            trControl = fitControl, tune.length = 3, family = poisson)

I get this error :

+ Resample01: df=1 
model fit failed for Resample01: df=1 Error in as.matrix(x) : object 'N' not found

- Resample01: df=1 
+ Resample01: df=2 
model fit failed for Resample01: df=2 Error in as.matrix(x) : object 'N' not found  .....

Please does anyone know what I'm doing wrong here?

Thanks

There are a two things wrong with your code.

  1. The train function can be a bit tedious depending on the method you used (as you have noticed). In the case of method = "gamSpline" , the train function adds a smooth term to every independent term in the formula. So it converts your variables to s(log(N), df) , s(log(age) df) and to s(s(cohort, df), df) . Wait s(s(cohort, df), df) does not really makes sense. So you must change s(cohort, df) to cohort .

  2. I am not sure why, but the train with method = "gamSpline" does not like it when you put functions (eg log ) in the formula. I think this is due to the fact that this method already applies the s() functions to your variables. This problem can be solved by applying the log earlier to your variables. Such as df$N <- log(df$N) or logN <- log(df$N) and use logN as variable. And of course, do the same for age .

My guess is that you don't want this method to apply a smoothing term to all your independent variables based on the code you provided. I am not sure if this is possible and how to do it, if it is possible.

Hope this helps.

EDIT: If you want a more elegant solution than the one I provided at point 2, make sure to read the comment of @topepo. This suggestion also allows you to apply s() function to the variables you want if I understand it correctly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM