使用 mgcv 进行变量选择

Question

Is there a way of automating variable selection of a GAM in R, similar to step? R 中是否有一种自动选择 GAM 变量的方法，类似于步骤？ I've read the documentation of step.gam and selection.gam , but I've yet to see a answer with code that works.我已经阅读了step.gam和selection.gam的文档，但我还没有看到有效代码的答案。 Additionally, I've tried method= "REML" and select = TRUE , but neither remove insignificant variables from the model.此外，我试过method= "REML"和select = TRUE ，但都没有从 model 中删除无关紧要的变量。

I've theorized that I could create a step model and then use those variables to create the GAM, but that does not seem computationally efficient.我推测我可以创建步骤 model，然后使用这些变量创建 GAM，但这在计算上似乎效率不高。

Example:例子：

library(mgcv)

set.seed(0)
dat <- data.frame(rsp = rnorm(100, 0, 1), 
                  pred1 = rnorm(100, 10, 1), 
                  pred2 = rnorm(100, 0, 1), 
                  pred3 = rnorm(100, 0, 1), 
                  pred4 = rnorm(100, 0, 1))

model <- gam(rsp ~ s(pred1) + s(pred2) + s(pred3) + s(pred4),
             data = dat, method = "REML", select = TRUE)

summary(model)

#Family: gaussian 
#Link function: identity 

#Formula:
#rsp ~ s(pred1) + s(pred2) + s(pred3) + s(pred4)

#Parametric coefficients:
#            Estimate Std. Error t value Pr(>|t|)
#(Intercept)  0.02267    0.08426   0.269    0.788

#Approximate significance of smooth terms:
#            edf Ref.df     F p-value  
#s(pred1) 0.8770      9 0.212  0.1174  
#s(pred2) 1.8613      9 0.638  0.0374 *
#s(pred3) 0.5439      9 0.133  0.1406  
#s(pred4) 0.4504      9 0.091  0.1775  
---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#R-sq.(adj) =  0.0887   Deviance explained = 12.3%
#-REML = 129.06  Scale est. = 0.70996   n = 100

Answer 1

Marra and Wood (2011, Computational Statistics and Data Analysis 55; 2372-2387) compare various approaches for feature selection in GAMs. Marra and Wood（2011，计算统计和数据分析55; 2372-2387）比较了GAM中各种特征选择方法。 They concluded that an additional penalty term in the smoothness selection procedure gave the best results. 他们得出结论，在平滑度选择过程中附加惩罚项可提供最佳结果。 This can be activated in mgcv::gam() by using the select = TRUE argument/setting, or any of the following variations: 可以使用select = TRUE参数/设置或下列任何一种方式在mgcv :: gam（）中激活它：

model <- gam(rsp ~ s(pred1,bs="ts") + s(pred2,bs="ts") + s(pred3,bs="ts") + s(pred4,bs="ts"), data = dat, method = "REML")
model <- gam(rsp ~ s(pred1,bs="cr") + s(pred2,bs="cr") + s(pred3,bs="cr") + s(pred4,bs="cr"),
             data = dat, method = "REML",select=T)
model <- gam(rsp ~ s(pred1,bs="cc") + s(pred2,bs="cc") + s(pred3,bs="cc") + s(pred4,bs="cc"),
             data = dat, method = "REML")
model <- gam(rsp ~ s(pred1,bs="tp") + s(pred2,bs="tp") + s(pred3,bs="tp") + s(pred4,bs="tp"), data = dat, method = "REML")

Answer 2

In addition to specifying select = TRUE in your call to function gam , you can increase the value of argument gamma to get stronger penalization.除了在调用 function gam时指定select = TRUE之外，您还可以增加参数gamma的值以获得更强的惩罚。 For example, we generate some data:比如我们生成一些数据：

library("mgcv")
set.seed(2) 
dat <- gamSim(1, n=400, dist="normal", scale=5)
## Gu & Wahba 4 term additive model

We fit a GAM with 'standard' penalization and variable selection:我们为 GAM 配备“标准”惩罚和变量选择：

b <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data=dat, method = "REML")
summary(b)
##
## Family: gaussian 
## Link function: identity 
##
## Formula:
## y ~ s(x0) + s(x1) + s(x2) + s(x3)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.890      0.246   32.07   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Approximate significance of smooth terms:
##         edf Ref.df      F  p-value    
## s(x0) 1.363  1.640  0.804   0.3174    
## s(x1) 1.681  2.088 11.309 1.35e-05 ***
## s(x2) 5.931  7.086 16.240  < 2e-16 ***
## s(x3) 1.002  1.004  4.102   0.0435 *  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## R-sq.(adj) =  0.253   Deviance explained = 27.1%
## -REML = 1212.5  Scale est. = 24.206    n = 400
par(mfrow = c(2, 2)) 
plot(b)

We fit a GAM with stronger penalization and variable selection:我们用更强的惩罚和变量选择来拟合 GAM：

b2 <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data=dat, method = "REML", select = TRUE, gamma = 7)
## summary(b2)

## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x0) + s(x1) + s(x2) + s(x3)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.8898     0.2604    30.3   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Approximate significance of smooth terms:
##             edf Ref.df     F p-value    
## s(x0) 5.330e-05      9 0.000  0.1868    
## s(x1) 5.427e-01      9 0.967 7.4e-05 ***
## s(x2) 1.549e+00      9 6.210 < 2e-16 ***
## s(x3) 6.155e-05      9 0.000  0.0812 .  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## R-sq.(adj) =  0.163   Deviance explained = 16.7%
## -REML = 179.46  Scale est. = 27.115    n = 400
plot(b2)

According to the documentation, increasing the value of gamma produces smoother models, because it multiplies the effective degrees of freedom in the GCV or UBRE/AIC criterion.根据文档，增加gamma的值会产生更平滑的模型，因为它会增加 GCV 或 UBRE/AIC 标准中的有效自由度。

A possible downside is thus that all non-linear effects will be shrunken towards linear effects, and all linear effects will be shrunken towards zero.因此，一个可能的缺点是所有非线性效应都将缩小为线性效应，而所有线性效应将缩小为零。 This is what we also observe in the plots and output above: With higher value of gamma , some effects are practically penalized out ( edf values close 0, F-value of 0), while the other effects are closer to linear ( edf values closer to 1).这也是我们在上面的图中和 output 中观察到的：随着gamma值的增加，一些效果实际上被惩罚了（ edf值接近 0，F 值 0），而其他效果更接近线性（ edf值更接近到 1).

使用 mgcv 进行变量选择

问题描述

2 个解决方案

解决方案1
3 2016-07-25 18:38:05

解决方案2
0 2023-01-19 16:18:26

使用 mgcv 进行变量选择

问题描述

2 个解决方案

解决方案1 3 2016-07-25 18:38:05

解决方案2 0 2023-01-19 16:18:26

解决方案1
3 2016-07-25 18:38:05

解决方案2
0 2023-01-19 16:18:26