简体   繁体   中英

How to test the statistical significance of a random effect in GAMM?

I am now using the package mgcv to build a GAMM in R, and my questions are:

  • First, how can we know if the random effect is statistically significant or not?
  • Second, how can we extract the random intercept values in the model?
  • Third, what does the "offset" mean in gamm ? I have checked the R help, but I am still confused about the "offset" term in the function? Thanks for any help.

The example is taken from the book Generalized additive models: an introduction with R

library(mgcv)
library(gamair)

data(sole)
sole$off <- log(sole$a.1-sole$a.0)
sole$a<-(sole$a.1+sole$a.0)/2 
solr<-sole
solr$t<-solr$t-mean(sole$t)
solr$t<-solr$t/var(sole$t)^0.5
solr$la<-solr$la-mean(sole$la)
solr$lo<-solr$lo-mean(sole$lo)

solr$station <- factor(with(solr,paste(-la,-lo,-t,sep="")))  
som <- gamm(eggs~te(lo,la,t,bs=c("tp","tp"),k=c(25,5),d=c(2,1))
        +s(t,k=5,by=a)+offset(off), family=quasipoisson,
        data=solr,random=list(station=~1))

Note that for this model it might make more sense to use a Tweedie response via family tw with gam() and bam() , which can't be used with gamm() . In fact, Simon Wood and Matteo Fasiolo fit these data with a location scale Tweedie GAM (wherein they model the mean, variance and power parameters of the Tweedie distribution each with a separate linear predictor [model]).

At @BenBolker's suggestion: I wouldn't even bother testing the random effect in this model specifically, and often I don't care if it is significant or not. It depends on the question or hypothesis I am working on. Often I want it in the model due to some clustering in the data that I want included in the model regardless of the significance.

However, I'm not convinced that the theory of the (Generalized) likelihood ratio test (GLRT) doesn't apply to the use of quasi-likelihood in this instance . Simon Wood presents derivations in Appendix A of the 2nd edition of his textbook on GAMS that show that the previously-derived results for maximum likelihood estimation (which include results for the GLRT) hold if we replace the log likelihood with the log quasi likelihood. This, at least Simon seems to be arguing, would suggest that the interpretation of the test I mention below and which is implemented in summary.gam() for random effects, is as reliable as if it were based on a proper likelihood.

Unless I really needed to, I'd fit this model with gam() or bam() and then gamm4() (the latter from the gamm4 package), before gamm() , especially for non-Gaussian models, as the gamm() function has to fit this model as a mixed effects model using penalised quasi likelihood, which is not necessarily the best way of estimating these models.

library(mgcv)
library(gamair)
devtools::install_github('gavinsimpson/gratia')
library(gratia)

data(sole)
sole$off <- log(sole$a.1-sole$a.0)
sole$a<-(sole$a.1+sole$a.0)/2 
solr <- sole
solr$t <- solr$t-mean(sole$t)
solr$t <- solr$t/var(sole$t)^0.5
solr$la <- solr$la-mean(sole$la)

solr$lo <- solr$lo-mean(sole$lo)
solr$station <- factor(with(solr,paste(-la,-lo,-t,sep="")))

som <- gam(eggs ~ te(lo, la, t, bs = c('tp','tp'), k = c(25, 5), d = c(2,1)) + 
           s(t, k = 5, by = a) + s(station, bs = 're') + offset(off),
           family = quasipoisson, data = solr, method = 'REML')

Then summary(som) gives a test based on a likelihood ratio test as suggested by @BenBolker, but the reference distribution is corrected for testing on the boundary of the parameter space.

> summary(som)

Family: quasipoisson 
Link function: log 

Formula:
eggs ~ te(lo, la, t, bs = c("tp", "tp"), k = c(25, 5), d = c(2, 
    1)) + s(t, k = 5, by = a) + s(station, bs = "re") + offset(off)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.4016     0.3061  -11.11   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                edf  Ref.df      F  p-value    
te(lo,la,t)  56.025  65.456  2.547 4.62e-10 ***
s(t):a        4.535   4.886 54.790  < 2e-16 ***
s(station)  128.563 388.000  1.175  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.833   Deviance explained =   88%
-REML = -7.9014  Scale est. = 0.58148   n = 1575

I was having trouble getting the model without the random effect to converge using gamm() so I was unable to test the random effect term and even encountered an error when trying the multi-model form of anova() .

If you want to get the random effects, using the gam() model, you can use my gratia package (hopefully on CRAN in a few days but which can be installed from github as shown above), and then:

> evaluate_smooth(som, 's(station)')
# A tibble: 394 x 5
   smooth    by_variable station                                       est    se
   <chr>     <fct>       <chr>                                       <dbl> <dbl>
 1 s(statio… NA          -0.0004304761904734280.419685714285714-… -0.0396   2.55
 2 s(statio… NA          -0.0004304761904734280.6586857142857140…  1.48     1.20
 3 s(statio… NA          -0.0004304761904734281.15968571428571-1… -0.00606  2.63
 4 s(statio… NA          -0.0004304761904734281.176685714285710.… -0.0767   2.48
 5 s(statio… NA          -0.002430476190475870.9096857142857141.… -0.00654  2.63
 6 s(statio… NA          -0.01243047619047390.4106857142857140.0… -0.802    1.61
 7 s(statio… NA          -0.0154304761904740.631685714285714-0.4… -0.138    2.35
 8 s(statio… NA          -0.02043047619047660.375685714285714-0.… -0.426    1.94
 9 s(statio… NA          -0.02543047619047911.14668571428571-0.4… -0.0333   2.57
10 s(statio… NA          -0.02743047619047450.875685714285714-0.… -0.0673   2.49
# … with 384 more rows

and you want the est column.

An offset is a term in the model that has a fixed effect of 1. In this instance it is being used to standardise the count response so that each you are comparing like for like; it is being used in this instance to integrate over the ages of eggs found in this sample. Read p.143 of the 2nd ed of Simon's GAM book to see more about what is being done for this model and what the offset means.

More generally, say you sample a river with two nets; one net has twice the area than the other. You are more likely to capture more things in the larger net and hence the counts from the larger net will be higher on account of the greater sampling effort — you swept more of the river with the bigger net (assuming you sampled for the same amount of time). To make sure that you account for this difference in effort, you can include an offset in the model. The offset will be (for a Poisson model with log link) offset(log(net_area)) . We have to include the offset on the link scale, hence the log() . Now what we are modelling is the count per unit area of net.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM