How to create a for loop to go through multiple year combinations for a glm in R?

Question

I have a large dataset with the presence and absence (0,1) of Blue Rockfish and multiple variables (in my case, bathymetry, curvature, eastness, fine scale BPI, and broad scale BPI).

structure(list(Pres_Abs = c(1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L), CommonName = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = "Blue Rockfish", class = "factor"), Survey_Yea = c(2009L, 
2014L, 2005L, 2015L, 2006L, 2009L, 2014L, 2015L, 2015L, 2015L, 
2005L, 2014L, 2015L, 2015L, 2014L, 2015L, 2015L, 2015L, 2015L, 
2006L), ca_10mbath = c(-42.6913986, -36.0038986, -36.5155983, 
-44.4014816, -39.3320007, -45.7226982, -47.9375, -51.5976982, 
-37.7812996, -14.1093302, -70.5976562, -41.5121307, -48.4246902, 
-46.0937996, -38.9961014, -46.375, -42.6913986, -60.96875, -46.375, 
-37.6601601), ca10_bpi24 = c(-12L, -2L, -2L, -2L, -2L, -2L, 7L, 
37L, -2L, 77L, -2L, -2L, 17L, 7L, -2L, -2L, -12L, -2L, -2L, 67L
), ca_10m_cur = c(-0.0859528, -0.0006409, -0.0068855, -0.5154228, 
-0.0390663, -0.0078316, -0.0221901, 0.792961, 0, 4.3303394, 0.0429688, 
-0.4405556, -0.1947556, 0.0195274, -0.230453, -0.0093803, -0.0859528, 
-0.2148438, -0.0093803, 0.0976486), ca_10m_eas = c(0.727106, 
0.887252, 0.565906, 0.9994883, 0.96552, 0.960033, 0.998732, 0.772206, 
0.589553, -0.4134142, -0.8266082, -0.3659272, -0.7330094, 0.0329623, 
0.998884, 0.271237, 0.727106, -0.5498384, 0.271237, 0.6424425
), ca10_bpi30 = c(-15L, -15L, -15L, -15L, -15L, -15L, -15L, -15L, 
-15L, 262L, -15L, -15L, -15L, -15L, -15L, -15L, -15L, -15L, -15L, 
-15L)), row.names = c(2032L, 3801L, 479L, 4421L, 997L, 1551L, 
3079L, 4657L, 5059L, 4104L, 261L, 2849L, 4460L, 4765L, 3535L, 
4842L, 4950L, 4323L, 4833L, 752L), class = "data.frame")

In addition, I have multiple years of data (2005, 2006, 2007, 2009, 2014, 2015). I am basically wanting to create a glm

Pres_Abs~bathy+curvature+eastness+broadscale+finescale, data=Blue_allyears, family=binomial(link=logit))

that goes through every combination of years. So, on the 1-year level, I created glms using data from 2005, then data from 2006, then data from 2007, etc. Within that code, I am saving data such as AIC, residual and null deviance, chi square, p value Etc.

This was my code (adapted from someone else on stackoverflow) that I used to loop through the first years of data:

results <- data.frame()


for(Survey_Yea in unique(Blue_allyears$Survey_Yea)){
  
  # dynamically generate formula
        fmla <- as.formula(Pres_Abs~ca_10mbath+ca_10m_cur+ca_10m_eas+ca10_bpi30+ca10_bpi24)

        # fit glm model
        fit<-glm(fmla,data=Blue_allyears[Blue_allyears$Survey_Yea == Survey_Yea,],family=binomial(link=logit))

        ## capture summary stats
        
        AIC <- AIC(fit)
        Deviance <- deviance(fit)
        NullDeviance <- fit$null.deviance
        null_minus_dev<-NullDeviance-Deviance
        df.residual<- fit$df.residual      
        df.null<-fit$df.null 
        df.null.minus.df.residual<-df.null-df.residual
       pvalue<- with(fit, 1-pchisq(null_minus_dev ,  df.null.minus.df.residual))
       Years<-"1"
   

        # get coefficents of fit
        cfit <- coef(summary(fit))

        # create temporary data frame
        df <- data.frame( Survey_Yea = Survey_Yea, 
                         AIC = AIC(fit), Deviance = deviance(fit),NullDeviance = fit$null.deviance, null.minus.dev=NullDeviance-Deviance, df.residual= fit$df.residual ,   df.null=fit$df.null , df.null.minus.df.residual=df.null-df.residual,  pvalue= pvalue,  Years="1", stringsAsFactors = F)

        # bind rows of temporary data frame to the results data frame
        results <- rbind(results, df)
}


results

This code was great and created glms from each year of data.

structure(list(Survey_Yea = c(2005L, 2006L, 2007L, 2009L, 2014L, 
2015L), AIC = c(731.84838805646, 480.699964265887, 113.681123536743, 
764.359566454308, 1482.05275641814, 1581.2853892652), Deviance = c(719.84838805646, 
468.699964265887, 101.681123536743, 752.359566454308, 1470.05275641814, 
1569.2853892652), NullDeviance = c(987.041585117362, 690.374591837705, 
174.673089501106, 1059.1288918956, 2412.15218834861, 2012.89941234608
), null.minus.dev = c(267.193197060902, 221.674627571818, 72.991965964363, 
306.769325441288, 942.099431930472, 443.614023080884), df.residual = c(706L, 
492L, 120L, 758L, 1734L, 1446L), df.null = c(711L, 497L, 125L, 
763L, 1739L, 1451L), df.null.minus.df.residual = c(5L, 5L, 5L, 
5L, 5L, 5L), pvalue = c(0, 0, 2.44249065417534e-14, 0, 0, 0), 
    Years = c("1", "1", "1", "1", "1", "1")), row.names = c(NA, 
-6L), class = "data.frame")

Now, I would like to go through two years of data and create the glms and extract the associated data. So, for example the year iterations would be: 2005 and 2006 2005 and 2007 2005 and 2009 2005 and 2014 2005 and 2015 2006 and 2007 2006 and 2009 etc.... 2014 and 2015

After doing this with two years of data, I'd like to go through every combination with three years of data, etc. until I get to using all years of data.

I have played around adding another for loop or adding in combn() to my code but with no luck.

Any help would be much appreciated!

Also, this is my first time posting so let me know if you need more data. Thanks!

Answer 1

Consider encapsulating all your processing in a defined method where you receive the combination vector of years and number of years as parameters. Then, iterate with lapply + combn .

Function

run_model <- function(vec, yr) {
  # subset data by years
  sub <- Blue_allyears[Blue_allyears$Survey_Yea %in% vec,]
  
  # dynamically generate formula
  fmla <- Pres_Abs ~ ca_10mbath+ca_10m_cur+ca_10m_eas+ca10_bpi30+ca10_bpi24
  
  # fit glm model
  fit <- glm(fmla, data=sub, family=binomial(link=logit))
  
  ## capture summary stats
  AIC <- AIC(fit)
  Deviance <- deviance(fit)
  NullDeviance <- fit$null.deviance
  null_minus_dev <- NullDeviance - Deviance
  df.residual <- fit$df.residual      
  df.null <- fit$df.null 
  df.null.minus.df.residual <- df.null - df.residual
  pvalue <- 1 - pchisq(null_minus_dev,  df.null.minus.df.residual)
  
  # get coefficents of fit
  cfit <- coef(summary(fit))
  
  # create temporary data frame
  df <- data.frame(
    Survey_Yea = paste(vec, collapse=", "), 
    AIC = AIC,
    Deviance = Deviance,
    NullDeviance = NullDeviance, 
    null.minus.dev = null_minus_dev,
    df.residual = df.residual,   
    df.null = df.null, 
    df.null.minus.df.residual = df.null.minus.df.residual,
    pvalue = pvalue, 
    Years = yr, 
    stringsAsFactors = FALSE   # DEFAULT IN R 1.4.0+
  )
  
  return(df)
}

Call

years <- sort(unique(Blue_allyears$Survey_Yea))

# RETURN NESTED LIST OF MANY DATA FRAMES
results_df_list <- lapply(1:3, function(i) combn(
  years, i, run_model, simplify=FALSE, num_yr=i)
)

# RETURN FLATTENED LIST OF THREE DATA FRAMES AND
# RENAME ELEMENTS
results_df_list <- setNames(
  lapply(results_df_list, function(dfs) do.call(rbind, dfs)),
  c("years_1", "years_2", "years_3")
)

# REVIEW EMBEDDED DATA FRAMES
View(results_df_list$years_1)
View(results_df_list$years_2)
View(results_df_list$years_3)

Demo

To demonstrate with random data matching structure of OP's screenshot image:

set.seed(52222)
Blue_allyears <- data.frame(
  Survey_Yea = sample(2005:2014, 500, replace=TRUE),
  Pres_Abs = sample(0:1, 500, replace=TRUE),
  ca_10mbath = runif(500),
  ca_10m_cur = runif(500),
  ca_10m_eas = runif(500),
  ca10_bpi30 = runif(500),
  ca10_bpi24 = runif(500)
)

#...run above blocks...

head(results_df_list$years_1)
#   Survey_Yea      AIC Deviance NullDeviance null.minus.dev df.residual df.null df.null.minus.df.residual     pvalue Years
# 1       2005 83.68461 71.68461     81.77442      10.089809          53      58                         5 0.07273019     1
# 2       2006 68.09388 56.09388     60.28383       4.189951          41      46                         5 0.52240456     1
# 3       2007 69.25363 57.25363     62.18310       4.929472          39      44                         5 0.42454811     1
# 4       2008 79.01764 67.01764     70.52444       3.506803          45      50                         5 0.62235846     1
# 5       2009 81.57290 69.57290     74.19185       4.618955          48      53                         5 0.46412711     1
# 6       2010 85.46602 73.46602     76.88259       3.416573          51      56                         5 0.63604708     1

head(results_df_list$years_2)
#   Survey_Yea      AIC Deviance NullDeviance null.minus.dev df.residual df.null df.null.minus.df.residual    pvalue Years
# 1 2005, 2006 152.5382 140.5382     145.0927       4.554509         100     105                         5 0.4726236     2
# 2 2005, 2007 153.2814 141.2814     144.0207       2.739315          98     103                         5 0.7400991     2
# 3 2005, 2008 159.2930 147.2930     152.3469       5.053860         104     109                         5 0.4093425     2
# 4 2005, 2009 160.5739 148.5739     156.2174       7.643473         107     112                         5 0.1770101     2
# 5 2005, 2010 167.3905 155.3905     159.5665       4.176056         110     115                         5 0.5243568     2
# 6 2005, 2011 153.0582 141.0582     145.5514       4.493158          99     104                         5 0.4807993     2

head(results_df_list$years_3)
#         Survey_Yea      AIC Deviance NullDeviance null.minus.dev df.residual df.null df.null.minus.df.residual    pvalue Years
# 1 2005, 2006, 2007 219.1731 207.1731     208.5284       1.355302         145     150                         5 0.9291396     3
# 2 2005, 2006, 2008 225.7515 213.7515     216.8769       3.125365         151     156                         5 0.6806653     3
# 3 2005, 2006, 2009 228.9630 216.9630     221.4069       4.443965         154     159                         5 0.4874155     3
# 4 2005, 2006, 2010 235.7721 223.7721     225.9108       2.138620         157     162                         5 0.8296509     3
# 5 2005, 2006, 2011 218.5088 206.5088     209.4254       2.916605         146     151                         5 0.7128412     3
# 6 2005, 2006, 2012 213.4275 201.4275     210.2102       8.782750         147     152                         5 0.1180497     3

How to create a for loop to go through multiple year combinations for a glm in R?

Question

1 answers

solution1
1 ACCPTED 2022-05-23 00:57:12

How to create a for loop to go through multiple year combinations for a glm in R?

Question

1 answers

solution1 1 ACCPTED 2022-05-23 00:57:12

solution1
1 ACCPTED 2022-05-23 00:57:12