简体   繁体   English

在R泊松回归中结合使用CARET和GAM(“ gamSpline”方法)

[英]Using CARET together with GAM (“gamSpline” method) in R Poisson Regression

I am trying to use caret package to tune 'df' parameter of a gam model for my cohort analysis. 我正在尝试使用插入符号包来调整gam模型的'df'参数,以进行同类群组分析。

With the following data: 带有以下数据:

cohort = 1:60
age = 1:26
grid = data.frame(expand.grid(age = age, cohort = cohort))
size = data.frame(cohort = cohort, N = sample(100:150,length(cohort), replace = TRUE))
df = merge(grid, size, by = "cohort")

log_k = -3 + log(df$N) - 0.5*log(df$age) + df$cohort*(df$cohort-30)*(df$cohort-50)/20000 + runif(nrow(df),min = 0, max = 0.5)
df$conversion = rpois(nrow(df),exp(log_k))

Explanation of the data : Cohort number is the time of arrival of the potential customer. 数据说明 :同类群组是潜在客户的到达时间。 N is the number of potential customer that arrived at that time. N是当时到达的潜在客户数。 Conversion is the number out of those potential customer that 'converted' (bought something). 转化是指那些“转化”(购买了某些东西)的潜在客户的数量。 Age is the age (time spent from arrival) of the cohort when conversion took place. 年龄是发生转化的同类群组的年龄(从到达起花费的时间)。 For a given cohort there are fewer conversions as age grows. 对于给定的同类群组,随着年龄的增长,转化次数会减少。 This effect follows a power law. 该效应遵循幂定律。 But the total conversion rate of each cohort can also change slowly in time (cohort number). 但是,每个同类群组的总转化率也会随着时间(同类群组数量)而缓慢变化。 Thus I want a smoothing spline of the time variable in my model. 因此,我需要模型中时间变量的平滑样条。

I can fit a gam model from package gam 我可以从gam包中安装一个gam模型

library(gam)
fit = gam(conversion ~ log(N) + log(age) + s(cohort, df = 4), data = df, family = poisson)
fit
> Call:
> gam(formula = conversion ~ log(N) + log(age) + s(cohort, df = 4), 
> family = poisson, data = df)

> Degrees of Freedom: 1559 total; 1553 Residual
> Residual Deviance: 1869.943 

But if i try to train the model using the CARET package 但是如果我尝试使用CARET套件训练模型

library(caret)
fitControl = trainControl(verboseIter = TRUE)
fit.crt = train(conversion ~ log(N) + log(age) + s(cohort,df),
            data = df, method = "gamSpline",
            trControl = fitControl, tune.length = 3, family = poisson)

I get this error : 我收到此错误:

+ Resample01: df=1 
model fit failed for Resample01: df=1 Error in as.matrix(x) : object 'N' not found

- Resample01: df=1 
+ Resample01: df=2 
model fit failed for Resample01: df=2 Error in as.matrix(x) : object 'N' not found  .....

Please does anyone know what I'm doing wrong here? 请问有人知道我在做什么错吗?

Thanks 谢谢

There are a two things wrong with your code. 您的代码有两件事。

  1. The train function can be a bit tedious depending on the method you used (as you have noticed). train功能可能会有些乏味,具体取决于您使用的方法(您已经注意到)。 In the case of method = "gamSpline" , the train function adds a smooth term to every independent term in the formula. method = "gamSpline"的情况下, train函数将平滑项添加到公式中的每个独立项 So it converts your variables to s(log(N), df) , s(log(age) df) and to s(s(cohort, df), df) . 因此它将变量转换为s(log(N), df)s(log(age) df)s(s(cohort, df), df) Wait s(s(cohort, df), df) does not really makes sense. 等待s(s(cohort, df), df)确实没有道理。 So you must change s(cohort, df) to cohort . 因此,您必须将s(cohort, df)更改为cohort

  2. I am not sure why, but the train with method = "gamSpline" does not like it when you put functions (eg log ) in the formula. 我不确定为什么,但是当您在公式中放置函数(例如log )时,使用method = "gamSpline"train不喜欢它。 I think this is due to the fact that this method already applies the s() functions to your variables. 我认为这是由于该方法已经将s()函数应用于您的变量。 This problem can be solved by applying the log earlier to your variables. 可以通过将日志更早地应用于变量来解决此问题。 Such as df$N <- log(df$N) or logN <- log(df$N) and use logN as variable. 例如df$N <- log(df$N)logN <- log(df$N)并使用logN作为变量。 And of course, do the same for age . 当然, age也一样。

My guess is that you don't want this method to apply a smoothing term to all your independent variables based on the code you provided. 我的猜测是,您不希望此方法根据您提供的代码将平滑项应用于所有自变量。 I am not sure if this is possible and how to do it, if it is possible. 我不确定这是否可能,以及如何做到。

Hope this helps. 希望这可以帮助。

EDIT: If you want a more elegant solution than the one I provided at point 2, make sure to read the comment of @topepo. 编辑:如果您想要比我在第2点提供的解决方案更优雅的解决方案,请确保阅读@topepo的注释。 This suggestion also allows you to apply s() function to the variables you want if I understand it correctly. 如果我正确理解,此建议还允许您将s()函数应用于所需的变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM