Model 選擇多重二項式 GAM (MGCV) 和/或多重邏輯回歸

Question

我正在嘗試 model 基於我進行的實驗中的 4 個連續變量的邏輯響應。 最初，我使用多元回歸並取得了相當不錯的結果，但最近有人建議我應該使用 GAMs。 我對如何正確地為 GAM 選擇 model 以及如何解釋我從多元回歸 GLM 中得到的一些警告感到有點迷茫。 我懷疑我的問題來自過度擬合，但我不知道如何解決它們。

基本上問題是：model 這些數據的最佳/最簡約的方法是什么？

東風：

df <- structure(list(response = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                            0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                            0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0),
               V1 = c(14.2,13.67, 13.05, 14.18, 13.4, 14.12, 14.22, 14.15, 13.35, 13.67, 
                      18.58, 18.27, 18.6, 17.94, 18.38, 18.98, 18.15, 19, 18.55, 18.53, 
                      20.77, 21.65, 21.03, 21.57, 21.25, 21.63, 21.6, 21.09, 21.62, 
                      21.6, 26.23, 26.52, 25.7, 26.57, 26.6, 26.25, 26.48, 26.26, 26.25, 
                      26.4, 28.98, 29.45, 29.2, 29.65, 29.38, 28.6, 28.42, 28.95, 28.85, 
                      28.8), V2 = c(27.2, 37.98, 24.63, 32.97, 30.27, 18.66, 13.77, 
                      33.99, 15.8, 21.32, 14.21, 15.81, 35.83, 21.64, 26.93, 38.62, 
                      34.03, 18.76, 24.12, 29.67, 29.83, 33.22, 27.11, 24.92, 21.72, 
                      39.02, 12.93, 18.44, 36.34, 15.81, 13.29, 21.04, 19.05, 33.62, 
                      30.52, 16.07, 28.43, 24.97, 39.9, 37.05, 19.31, 31.3, 34.08, 
                      13.63, 25.1, 28.93, 22.36, 34.69, 39.5, 16.41), 
               V3 = c(8.06, 7.87, 7.81, 7.72, 8.04, 7.66, 7.72, 7.87, 7.72, 7.98, 7.59, 7.9, 
                      8.08, 7.64, 8.02, 7.73, 7.77, 7.74, 7.66, 7.71, 8.05, 7.68, 7.63, 
                      7.7, 7.64, 7.8, 7.7, 7.98, 7.86, 7.68, 7.65, 7.74, 7.99, 7.75, 
                      7.91, 7.64, 7.69, 7.78, 7.69, 7.66, 7.72, 7.76, 7.71, 7.88, 7.63, 
                      7.7, 7.99, 7.82, 7.75, 7.93), 
               V4 = c(362.12, 645.38, 667.54, 
                      957.44, 391.84, 818.34, 732.91, 649.05, 722.02, 406.71, 918.9, 
                      471.32, 363.77, 926.82, 385.4, 1038.91, 850.67, 715.11, 964.79, 
                      890.11, 370.51, 1078.68, 1083.7, 893.76, 1026.1, 887.29, 737.68, 
                      406.76, 690.39, 872.8, 847.26, 738.53, 397.33, 895.3, 563.93, 
                      991.17, 957.28, 734.55, 1140.5, 1199.12, 817.17, 800.5, 992.82, 
                      533.45, 1123.29, 943.25, 411.59, 745.22, 929.75, 460.82)), 
          row.names = c(NA,-50L), class = "data.frame")

我應該注意到，通過做實驗和了解系統，我知道 V1 和 V2 對響應的影響最大。 您還可以通過僅按這些變量繪制響應來看到這一點，因為所有積極響應都聚集在這個二維空間中。 此外，看看一些臨時樣條，似乎 V1 與響應線性相關，V2 是二次方，V3 可能根本不是，V4 可能是弱二次方。

另一個重要說明：V3 和 V4 基本上是同一事物的兩個不同度量，因此它們高度相關，不會在任何模型中一起使用。

所以首先我嘗試在多重邏輯回歸中對所有這些進行建模：有人建議我在我的 model 選擇中測試一大堆不同的模型，所以我將它們寫在一個列表中並在一個循環中運行它們：

formulas <- list(# single predictors
                 response ~ V1,
                 response ~ V2,
                 response ~ V3,
                 response ~ V4,

                 # two predictors
                 response ~ V1 + V2,
                 response ~ V1 + V3,
                 response ~ V1 + V4,
                 response ~ V2 + V3,
                 response ~ V2 + V4,

                 # three predictors
                 response ~ V1 + V2 + V3,
                 response ~ V1 + V2 + V4,

                 # start quadratic models
                 response ~  V2 + I(V2^2) + V1 + I(V1^2),
                 response ~  V2 + I(V2^2) + V1 + I(V1^2) + V3,
                 response ~  V2 + I(V2^2) + V1 + I(V1^2) + V4,
                 response ~ V1 + V2 + I(V1^2),
                 response ~ V1 + V2 + I(V1^2) + V3,
                 response ~ V1 + V2 + I(V1^2) + V4,
                 response ~ V1 + I(V1^2),
                 response ~ V1 + V2 + I(V2^2),
                 response ~ V1 + V2 + I(V2^2) + V3,
                 response ~ V1 + V2 + I(V2^2) + V4,
                 response ~  V2 + I(V2^2),
                 response ~  V2 + I(V2^2) + V1 + I(V1^2),
                 # add interactions
                 response ~ V1 + V2 + V1*V2,
                 response ~ V1 + V2 + V1*V2 + V3,
                 response ~ V1 + V2 + V1*V2 + V4,
                 # quadratic with interaction
                 response ~ V1 + V2 + V1*V2 + V3 + I(V1^2),
                 response ~ V1 + V2 + V1*V2 + V3 + I(V2^2),
                 response ~ V1 + V2 + V1*V2 + V4 + I(V1^2),
                 response ~ V1 + V2 + V1*V2 + V4 + I(V2^2)

)

# run them all in a loop, then order by AIC
selection <- purrr::map_df(formulas, ~{
  mod <- glm(.x, data= df, family="binomial")
  data.frame(formula = format(.x), 
             AIC = round(AIC(mod),2), 
             BIC = round(BIC(mod),2),
             R2adj = round(DescTools::PseudoR2(mod,which=c("McFaddenAdj")),3)
  )
})

warnings()

warnings()
# this returns a bunch of warnings about coercing the formulas into vectors, ignore those.
# however, this also lists the following for a handful of the models:
# "glm.fit: fitted probabilities numerically 0 or 1 occurred"
# which means perfect separation, but I'm not sure if this is a totally bad thing
# or not, as perfect separation actually pretty much does exist in the real data


# then we arrange by AIC and get our winning model:
selection %>% arrange(desc(AIC))

因此，使用該技術，我們發現兩個最佳模型是response ~ V1 + V2 + I(V2^2) + V4和response ~ V1 + V2 + I(V2^2) 。 但是當一次運行它們時，我們會得到numerically 1 or 0錯誤，並且我們看到，在最好的 model 中，它們之間的唯一差異（添加的 V4）本身在統計上並不顯着。 所以..我們用哪個？？

bestmod1 <- glm(response ~ V1 + V2 + I(V2^2) + V4,
                family="binomial",
                data=df)
summary(bestmod1)$coefficients

bestmod2 <- glm(response ~ V1 + V2 + I(V2^2),
                family="binomial",
                data=df)
summary(bestmod2)$coefficients

方法 2：GAM

這里有類似的技術，列出所有公式並循環運行。

library(mgcv)
gam_formulas <- list( # ind. main effects,
  response ~ s(V1),
  response ~ s(V2),
  response ~ s(V3),
  response ~ s(V4),

  # two variables
  response ~ s(V1) + s(V2),
  response ~ s(V1) + s(V3),
  response ~ s(V1) + s(V4),
  response ~ s(V2) + s(V3),
  response ~ s(V2) + s(V4),

  # three variables
  response ~ s(V1) + s(V2) + s(V3),
  response ~ s(V1) + s(V2) + s(V4),

  # add interactions
  response ~ te(V1, V2),
  response ~ te(V1, V2) + s(V3),
  response ~ te(V1, V2) + s(V4),
  response ~ te(V1, V3),
  response ~ te(V1, V3) + s(V2),
  response ~ te(V1, V4),
  response ~ te(V1, V4) + s(V2),                  
  response ~ te(V2, V3),
  response ~ te(V2, V3) + s(V1),
  response ~ te(V2, V4),
  response ~ te(V2, V4) + s(V1), 
  response ~ te(V2, by=V1),
  response ~ te(V1, by=V2),
  response ~ te(V2, by=V3),

  # two interactions?
  response ~ te(V1, V3) + te(V1, V2),
  response ~ te(V1, V4) + te(V1, V2),
  response ~ te(V2, V3) + te(V1, V2),
  response ~ te(V2, V4) + te(V1, V2)
)

gam_selection <- purrr::map_df(gam_formulas, ~{
  gam <- gam(.x, 
             data= df,  # always use same df
             family="binomial",
             method="REML")  # always use REML smoothing method
  data.frame(cbind(formula = as.character(list(formula(.x))),
                   round(broom::glance(gam),2),
                   R2 = summary(gam)$r.sq
  ))
})

# similarly, this gives a bunch of warnings about coercing the formulas into characters, 
# but also this, for a handful of the models, which I am guessing is an overfitting thing:
#  In newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS,  ... :
#  Fitting terminated with step failure - check results carefully


gam_selection %>% arrange(desc(AIC))

但這會返回一堆古怪的東西，因為許多模型（甚至不一定具有相似的公式或 AIC 值）都說 R2 = 1.00，而且它們是在生物學上沒有多大意義的公式。 為什么會這樣？ 我該怎么辦？ （我知道這與使用“REML”有關，因為其中一些錯誤 go 沒有該行）。 根據 AIC 值，我認為它實際上最准確的是倒數第三： response ~ te(V2, by = V1) ，使用 V2 作為平滑變量，使用 V1 作為線性變量。

此外，根據 AIC 的說法，當更仔細地查看前 2 個游戲時，沒有一個變量本身是顯着的（p 值 = 1.. 很奇怪），這讓我覺得我不應該使用這些。

bestgam <- gam(response ~ s(V1) + s(V2) + s(V4), 
               data= df,  # always use same df
               family="binomial",
               method="REML")
summary(bestgam)
bestgam2 <- gam(response ~ s(V1) + s(V2) + s(V3), 
               data= df,  # always use same df
               family="binomial",
               method="REML")
summary(bestgam2)
bestgam3 <- gam(response ~ te(V2, by = V1), 
                data= df,  # always use same df
                family="binomial",
                method="REML")
summary(bestgam3) # this is the one I think I should be using

基本上我不知道為什么我會使用 GAM 而不是 GLM，反之亦然，然后如何 select 變量並避免在此過程中過度擬合。 任何建議表示贊賞。

謝謝！

Answer 1

如果您認為因變量和自變量之間存在非線性關系，您將使用 GAM。

對於 model 選擇，您可以在 model 中為平滑器添加收縮，以便在不需要時可以將其從 model 中剔除。

在mgcv package 中有兩種方法可以做到這一點：

更改您可能想要縮小的任何s()函數的基類型。 例如，如果您使用薄板回歸樣條，則添加bs = 'ts' ，如果使用三次回歸樣條，則添加bs = 'cs' 。
在調用gam()時指定select = TRUE

有關其工作原理以及此答案中兩種方法之間的區別的更多詳細信息： https://stats.stackexchange.com/questions/405129/model-selection-for-gam-in-r

Model 選擇多重二項式 GAM (MGCV) 和/或多重邏輯回歸

問題描述

1 個解決方案

解決方案1
0 2021-05-14 11:22:46

Model 選擇多重二項式 GAM (MGCV) 和/或多重邏輯回歸

問題描述

1 個解決方案

解決方案1 0 2021-05-14 11:22:46

解決方案1
0 2021-05-14 11:22:46