简体   繁体   English

使用 mgcv 进行变量选择

[英]Variable Selection with mgcv

Is there a way of automating variable selection of a GAM in R, similar to step? R 中是否有一种自动选择 GAM 变量的方法,类似于步骤? I've read the documentation of step.gam and selection.gam , but I've yet to see a answer with code that works.我已经阅读了step.gamselection.gam的文档,但我还没有看到有效代码的答案。 Additionally, I've tried method= "REML" and select = TRUE , but neither remove insignificant variables from the model.此外,我试过method= "REML"select = TRUE ,但都没有从 model 中删除无关紧要的变量。

I've theorized that I could create a step model and then use those variables to create the GAM, but that does not seem computationally efficient.我推测我可以创建步骤 model,然后使用这些变量创建 GAM,但这在计算上似乎效率不高。

Example:例子:

library(mgcv)

set.seed(0)
dat <- data.frame(rsp = rnorm(100, 0, 1), 
                  pred1 = rnorm(100, 10, 1), 
                  pred2 = rnorm(100, 0, 1), 
                  pred3 = rnorm(100, 0, 1), 
                  pred4 = rnorm(100, 0, 1))

model <- gam(rsp ~ s(pred1) + s(pred2) + s(pred3) + s(pred4),
             data = dat, method = "REML", select = TRUE)

summary(model)

#Family: gaussian 
#Link function: identity 

#Formula:
#rsp ~ s(pred1) + s(pred2) + s(pred3) + s(pred4)

#Parametric coefficients:
#            Estimate Std. Error t value Pr(>|t|)
#(Intercept)  0.02267    0.08426   0.269    0.788

#Approximate significance of smooth terms:
#            edf Ref.df     F p-value  
#s(pred1) 0.8770      9 0.212  0.1174  
#s(pred2) 1.8613      9 0.638  0.0374 *
#s(pred3) 0.5439      9 0.133  0.1406  
#s(pred4) 0.4504      9 0.091  0.1775  
---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#R-sq.(adj) =  0.0887   Deviance explained = 12.3%
#-REML = 129.06  Scale est. = 0.70996   n = 100

Marra and Wood (2011, Computational Statistics and Data Analysis 55; 2372-2387) compare various approaches for feature selection in GAMs. Marra and Wood(2011,计算统计和数据分析55; 2372-2387)比较了GAM中各种特征选择方法。 They concluded that an additional penalty term in the smoothness selection procedure gave the best results. 他们得出结论,在平滑度选择过程中附加惩罚项可提供最佳结果。 This can be activated in mgcv::gam() by using the select = TRUE argument/setting, or any of the following variations: 可以使用select = TRUE参数/设置或下列任何一种方式在mgcv :: gam()中激活它:

model <- gam(rsp ~ s(pred1,bs="ts") + s(pred2,bs="ts") + s(pred3,bs="ts") + s(pred4,bs="ts"), data = dat, method = "REML")
model <- gam(rsp ~ s(pred1,bs="cr") + s(pred2,bs="cr") + s(pred3,bs="cr") + s(pred4,bs="cr"),
             data = dat, method = "REML",select=T)
model <- gam(rsp ~ s(pred1,bs="cc") + s(pred2,bs="cc") + s(pred3,bs="cc") + s(pred4,bs="cc"),
             data = dat, method = "REML")
model <- gam(rsp ~ s(pred1,bs="tp") + s(pred2,bs="tp") + s(pred3,bs="tp") + s(pred4,bs="tp"), data = dat, method = "REML")

In addition to specifying select = TRUE in your call to function gam , you can increase the value of argument gamma to get stronger penalization.除了在调用 function gam时指定select = TRUE之外,您还可以增加参数gamma的值以获得更强的惩罚。 For example, we generate some data:比如我们生成一些数据:

library("mgcv")
set.seed(2) 
dat <- gamSim(1, n=400, dist="normal", scale=5)
## Gu & Wahba 4 term additive model

We fit a GAM with 'standard' penalization and variable selection:我们为 GAM 配备“标准”惩罚和变量选择:

b <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data=dat, method = "REML")
summary(b)
##
## Family: gaussian 
## Link function: identity 
##
## Formula:
## y ~ s(x0) + s(x1) + s(x2) + s(x3)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.890      0.246   32.07   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Approximate significance of smooth terms:
##         edf Ref.df      F  p-value    
## s(x0) 1.363  1.640  0.804   0.3174    
## s(x1) 1.681  2.088 11.309 1.35e-05 ***
## s(x2) 5.931  7.086 16.240  < 2e-16 ***
## s(x3) 1.002  1.004  4.102   0.0435 *  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## R-sq.(adj) =  0.253   Deviance explained = 27.1%
## -REML = 1212.5  Scale est. = 24.206    n = 400
par(mfrow = c(2, 2)) 
plot(b)

在此处输入图像描述

We fit a GAM with stronger penalization and variable selection:我们用更强的惩罚和变量选择来拟合 GAM:

b2 <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data=dat, method = "REML", select = TRUE, gamma = 7)
## summary(b2)

## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x0) + s(x1) + s(x2) + s(x3)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.8898     0.2604    30.3   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Approximate significance of smooth terms:
##             edf Ref.df     F p-value    
## s(x0) 5.330e-05      9 0.000  0.1868    
## s(x1) 5.427e-01      9 0.967 7.4e-05 ***
## s(x2) 1.549e+00      9 6.210 < 2e-16 ***
## s(x3) 6.155e-05      9 0.000  0.0812 .  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## R-sq.(adj) =  0.163   Deviance explained = 16.7%
## -REML = 179.46  Scale est. = 27.115    n = 400
plot(b2)

在此处输入图像描述

According to the documentation, increasing the value of gamma produces smoother models, because it multiplies the effective degrees of freedom in the GCV or UBRE/AIC criterion.根据文档,增加gamma的值会产生更平滑的模型,因为它会增加 GCV 或 UBRE/AIC 标准中的有效自由度。

A possible downside is thus that all non-linear effects will be shrunken towards linear effects, and all linear effects will be shrunken towards zero.因此,一个可能的缺点是所有非线性效应都将缩小为线性效应,而所有线性效应将缩小为零。 This is what we also observe in the plots and output above: With higher value of gamma , some effects are practically penalized out ( edf values close 0, F-value of 0), while the other effects are closer to linear ( edf values closer to 1).这也是我们在上面的图中和 output 中观察到的:随着gamma值的增加,一些效果实际上被惩罚了( edf值接近 0,F 值 0),而其他效果更接近线性( edf值更接近到 1).

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 mgcv:我可以降低因子“ by”变量的水平吗? - mgcv: can I drop a level for factor 'by' variable? Model 选择多重二项式 GAM (MGCV) 和/或多重逻辑回归 - Model Selection for multiple binomial GAM (MGCV) and/or multiple logistic regression 如何为R中的每个变量提取GAM {mgcv}的拟合值? - How to extract fitted values of GAM {mgcv} for each variable in R? R:mgcv中的变系数GAMM模型-提取“按”可变系数? - R: varying-coefficient GAMM models in mgcv - extracting 'by' variable coefficients? 从mgcv :: gam拟合中获得预测,该预测包含矩阵“ by”变量到平滑 - Obtaining predictions from an mgcv::gam fit that contains a matrix “by” variable to a smooth 高斯 GAM(MGCV 包)中因变量的方差估计? - Variance estimation of the dependent variable in a Gaussian GAM (MGCV package)? 用于分组变量的 R mgcv 包公式实现中的广义加性混合模型 (GAMM) - Generalized Additive Mixed Model (GAMM) in R mgcv Package Formula Implementation for Grouping Variable gam 和 mgcv 包的问题:变量 'ti(LOC_X)' 的类型(列表)无效 - Problem with gam and mgcv packages: invalid type (list) for variable 'ti(LOC_X)' XGBoost 对变量选择有效吗? - Is XGBoost effective for variable selection? 变量选择方法 - Variable selection methods
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM