简体   繁体   中英

Model Selection for multiple binomial GAM (MGCV) and/or multiple logistic regression

I am trying to model a logistic response based on 4 continuous variables from an experiment I conducted. Originally, I used a multiple regression and achieved a pretty good result, but it was recently suggested to me that I should be using GAMs instead. I'm feeling a bit lost on how to properly do model selection for GAMs, and also how to interpret some warnings I am getting from my multiple regression GLM. I suspect my problems are coming from overfitting, but I don't know how to get around them.

Basically the question is: what is the best/most parsimonious way to model these data?

df:

df <- structure(list(response = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                            0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                            0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0),
               V1 = c(14.2,13.67, 13.05, 14.18, 13.4, 14.12, 14.22, 14.15, 13.35, 13.67, 
                      18.58, 18.27, 18.6, 17.94, 18.38, 18.98, 18.15, 19, 18.55, 18.53, 
                      20.77, 21.65, 21.03, 21.57, 21.25, 21.63, 21.6, 21.09, 21.62, 
                      21.6, 26.23, 26.52, 25.7, 26.57, 26.6, 26.25, 26.48, 26.26, 26.25, 
                      26.4, 28.98, 29.45, 29.2, 29.65, 29.38, 28.6, 28.42, 28.95, 28.85, 
                      28.8), V2 = c(27.2, 37.98, 24.63, 32.97, 30.27, 18.66, 13.77, 
                      33.99, 15.8, 21.32, 14.21, 15.81, 35.83, 21.64, 26.93, 38.62, 
                      34.03, 18.76, 24.12, 29.67, 29.83, 33.22, 27.11, 24.92, 21.72, 
                      39.02, 12.93, 18.44, 36.34, 15.81, 13.29, 21.04, 19.05, 33.62, 
                      30.52, 16.07, 28.43, 24.97, 39.9, 37.05, 19.31, 31.3, 34.08, 
                      13.63, 25.1, 28.93, 22.36, 34.69, 39.5, 16.41), 
               V3 = c(8.06, 7.87, 7.81, 7.72, 8.04, 7.66, 7.72, 7.87, 7.72, 7.98, 7.59, 7.9, 
                      8.08, 7.64, 8.02, 7.73, 7.77, 7.74, 7.66, 7.71, 8.05, 7.68, 7.63, 
                      7.7, 7.64, 7.8, 7.7, 7.98, 7.86, 7.68, 7.65, 7.74, 7.99, 7.75, 
                      7.91, 7.64, 7.69, 7.78, 7.69, 7.66, 7.72, 7.76, 7.71, 7.88, 7.63, 
                      7.7, 7.99, 7.82, 7.75, 7.93), 
               V4 = c(362.12, 645.38, 667.54, 
                      957.44, 391.84, 818.34, 732.91, 649.05, 722.02, 406.71, 918.9, 
                      471.32, 363.77, 926.82, 385.4, 1038.91, 850.67, 715.11, 964.79, 
                      890.11, 370.51, 1078.68, 1083.7, 893.76, 1026.1, 887.29, 737.68, 
                      406.76, 690.39, 872.8, 847.26, 738.53, 397.33, 895.3, 563.93, 
                      991.17, 957.28, 734.55, 1140.5, 1199.12, 817.17, 800.5, 992.82, 
                      533.45, 1123.29, 943.25, 411.59, 745.22, 929.75, 460.82)), 
          row.names = c(NA,-50L), class = "data.frame")

I should note that from doing the experiments and knowing about the system, I know that V1 and V2 have the most influence on the response. You can also see that by plotting response by just those variables as all the positive responses are clustered in this 2-D space. Also, looking at some makeshift splines, it seems like V1 is linearly related to response, V2 is quadratically, V3 maybe not at all, and V4 maybe weak quadratic.

在此处输入图像描述

在此处输入图像描述

Another important note: V3 and V4 are basically two different measures of the same thing, so they are highly correlated and won't be used in any models together.

So first I tried modeling all of these in a mutliple logistic regression: I was advised to test a whole bunch of different models in my model selection, so I wrote them in a list and ran them all in a loop:

formulas <- list(# single predictors
                 response ~ V1,
                 response ~ V2,
                 response ~ V3,
                 response ~ V4,

                 # two predictors
                 response ~ V1 + V2,
                 response ~ V1 + V3,
                 response ~ V1 + V4,
                 response ~ V2 + V3,
                 response ~ V2 + V4,

                 # three predictors
                 response ~ V1 + V2 + V3,
                 response ~ V1 + V2 + V4,

                 # start quadratic models
                 response ~  V2 + I(V2^2) + V1 + I(V1^2),
                 response ~  V2 + I(V2^2) + V1 + I(V1^2) + V3,
                 response ~  V2 + I(V2^2) + V1 + I(V1^2) + V4,
                 response ~ V1 + V2 + I(V1^2),
                 response ~ V1 + V2 + I(V1^2) + V3,
                 response ~ V1 + V2 + I(V1^2) + V4,
                 response ~ V1 + I(V1^2),
                 response ~ V1 + V2 + I(V2^2),
                 response ~ V1 + V2 + I(V2^2) + V3,
                 response ~ V1 + V2 + I(V2^2) + V4,
                 response ~  V2 + I(V2^2),
                 response ~  V2 + I(V2^2) + V1 + I(V1^2),
                 # add interactions
                 response ~ V1 + V2 + V1*V2,
                 response ~ V1 + V2 + V1*V2 + V3,
                 response ~ V1 + V2 + V1*V2 + V4,
                 # quadratic with interaction
                 response ~ V1 + V2 + V1*V2 + V3 + I(V1^2),
                 response ~ V1 + V2 + V1*V2 + V3 + I(V2^2),
                 response ~ V1 + V2 + V1*V2 + V4 + I(V1^2),
                 response ~ V1 + V2 + V1*V2 + V4 + I(V2^2)

)

# run them all in a loop, then order by AIC
selection <- purrr::map_df(formulas, ~{
  mod <- glm(.x, data= df, family="binomial")
  data.frame(formula = format(.x), 
             AIC = round(AIC(mod),2), 
             BIC = round(BIC(mod),2),
             R2adj = round(DescTools::PseudoR2(mod,which=c("McFaddenAdj")),3)
  )
})

warnings()

warnings()
# this returns a bunch of warnings about coercing the formulas into vectors, ignore those.
# however, this also lists the following for a handful of the models:
# "glm.fit: fitted probabilities numerically 0 or 1 occurred"
# which means perfect separation, but I'm not sure if this is a totally bad thing
# or not, as perfect separation actually pretty much does exist in the real data


# then we arrange by AIC and get our winning model:
selection %>% arrange(desc(AIC))


So using that technique, we find the two best models are response ~ V1 + V2 + I(V2^2) + V4 and response ~ V1 + V2 + I(V2^2) . But when running them one at a time, we get the numerically 1 or 0 error for both, and we see that the only difference between then (the added V4) is not statistically significant on its own in the best model. So.. which do we use??

bestmod1 <- glm(response ~ V1 + V2 + I(V2^2) + V4,
                family="binomial",
                data=df)
summary(bestmod1)$coefficients

bestmod2 <- glm(response ~ V1 + V2 + I(V2^2),
                family="binomial",
                data=df)
summary(bestmod2)$coefficients

Method 2: GAMs

Similar technique here, list out all formulas and run in a loop.

library(mgcv)
gam_formulas <- list( # ind. main effects,
  response ~ s(V1),
  response ~ s(V2),
  response ~ s(V3),
  response ~ s(V4),

  # two variables
  response ~ s(V1) + s(V2),
  response ~ s(V1) + s(V3),
  response ~ s(V1) + s(V4),
  response ~ s(V2) + s(V3),
  response ~ s(V2) + s(V4),

  # three variables
  response ~ s(V1) + s(V2) + s(V3),
  response ~ s(V1) + s(V2) + s(V4),

  # add interactions
  response ~ te(V1, V2),
  response ~ te(V1, V2) + s(V3),
  response ~ te(V1, V2) + s(V4),
  response ~ te(V1, V3),
  response ~ te(V1, V3) + s(V2),
  response ~ te(V1, V4),
  response ~ te(V1, V4) + s(V2),                  
  response ~ te(V2, V3),
  response ~ te(V2, V3) + s(V1),
  response ~ te(V2, V4),
  response ~ te(V2, V4) + s(V1), 
  response ~ te(V2, by=V1),
  response ~ te(V1, by=V2),
  response ~ te(V2, by=V3),

  # two interactions?
  response ~ te(V1, V3) + te(V1, V2),
  response ~ te(V1, V4) + te(V1, V2),
  response ~ te(V2, V3) + te(V1, V2),
  response ~ te(V2, V4) + te(V1, V2)
)

gam_selection <- purrr::map_df(gam_formulas, ~{
  gam <- gam(.x, 
             data= df,  # always use same df
             family="binomial",
             method="REML")  # always use REML smoothing method
  data.frame(cbind(formula = as.character(list(formula(.x))),
                   round(broom::glance(gam),2),
                   R2 = summary(gam)$r.sq
  ))
})

# similarly, this gives a bunch of warnings about coercing the formulas into characters, 
# but also this, for a handful of the models, which I am guessing is an overfitting thing:
#  In newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS,  ... :
#  Fitting terminated with step failure - check results carefully


gam_selection %>% arrange(desc(AIC))

but this returns a bunch of whacky stuff as many of the models (not even necessarily with similar formulas or AIC values) say that R2 = 1.00, and they are formulas that don't make that much sense, biologically. Why is this happening? What do I do about it? (I know it has something to do with using "REML" because some of these errors go away without that line). The one I think it actually most accurate is third from best, according to AIC values: response ~ te(V2, by = V1) , using V2 as a smoothed variable and V1 as linear.

Also, when looking more carefully at the top 2 gams, according to AIC, none of the variables themselves are significant (p values = 1.. weird), which makes me feel like I should not be using these.

bestgam <- gam(response ~ s(V1) + s(V2) + s(V4), 
               data= df,  # always use same df
               family="binomial",
               method="REML")
summary(bestgam)
bestgam2 <- gam(response ~ s(V1) + s(V2) + s(V3), 
               data= df,  # always use same df
               family="binomial",
               method="REML")
summary(bestgam2)
bestgam3 <- gam(response ~ te(V2, by = V1), 
                data= df,  # always use same df
                family="binomial",
                method="REML")
summary(bestgam3) # this is the one I think I should be using

Basically I don't know why I would use GAM over GLM or vice versa, then how to select variables and avoid overfitting during the process. Any advice appreciated.

Thanks!

You would use GAM if you think there is a non-linear relationship between your dependent and independent variables.

For model selection, you can add shrinkage to the smoothers in the model so that they can be penalised out of the model if they're not needed.

There are two ways of doing this in the mgcv package:

  1. Change the basis type of any s() functions you might want to shrink. eg add bs = 'ts' if you're using thin plate regression splines, or bs = 'cs' if you're using cubic regression splines.

  2. Specify select = TRUE in your call to gam()

There's much more detail about how this works, and the difference between the two approaches in this answer: https://stats.stackexchange.com/questions/405129/model-selection-for-gam-in-r

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM