Coverage of posterior from mgcv

Question

I am fitting a spatial binomial model with mgcv package in R and want to simulate posterior distributions for prediction points (code below). I've been using simulated data to test the coverage properties of the posterior. What I've found is that when the overall prevalence is around 0.5 (50%) the coverage is pretty poor (~35% of true values lie within the 95% posterior interval) but this improves as you move away from 0.5. For example, when mean prevalence is 1%, ~97% lie within the 95% posterior. I guess my questions are:

Is this an inherent limitation of using GAMs/mgcv for this approach?
Is my Bayesian interpretation of the posterior incorrect?
Am I misspecifying something in my code?
Is there a better approach? (I've tried using the spaMM package to fit a model with a spatially correlated random effect (with Laplace approximation) which does a little better. No doubt MCMC methods would be even better but a geostatistical approach has limitations when scaling the amount of model/prediction points so I would love to use mgcv .

Any thoughts/comments would be extremely welcome!

Cheers, Hugh

library(mgcv)
library(RandomFields)
library(raster)

# Simluate some data
set.seed(1981)
mean <- 0
model <- RMexp(var=0.5, scale=50)
simu <- RandomFields::RFsimulate(model, x=1:256, 
                                 y=1:256, RFoptions(spConform=FALSE))

# Convert to raster
simu_raster <- raster(nrows = 256, ncol = 256, xmn=0, xmx=1, ymn=0, ymx=1)
simu_raster[] <- as.vector(simu)

# Add mean and onvert to probability
log_odds_raster <- mean + simu_raster 
prev_raster <- exp(log_odds_raster) / (1 + exp(log_odds_raster))

# simulate 1000 candidate sampling points
candidate_points <- coordinates(prev_raster)[sample(1:nrow(coordinates(prev_raster)), 1000),]

# Sample 100 of those and take binomial sample of 100 individuals per location 
sampled_points_idx <- sample(1:nrow(candidate_points), 100)
sampled_points <- as.data.frame(candidate_points[sampled_points_idx,])
sampled_points$n_pos <- rbinom(100, 100, extract(prev_raster, sampled_points))
sampled_points$n_neg <- 100 - sampled_points$n_pos

# Fit spatial GAM
spatial_mod <- gam(cbind(n_pos, n_neg) ~ s(x, y), 
                   data = sampled_points,
                   family="binomial")

# check k and plot observed v predicted
gam.check(spatial_mod)

# Simulate 1000 draws from the posterior at every non-sampled location
prediction_data <- as.data.frame(candidate_points[-sampled_points_idx,])
prediction_data$prev <- extract(prev_raster, prediction_data)
Cg <- predict(spatial_mod, prediction_data, type = "lpmatrix")
sims <- rmvn(1000, mu = coef(spatial_mod), V = vcov(spatial_mod, unconditional = TRUE))
fits <- Cg %*% t(sims)
fits_prev <- exp(fits) / (1 + exp(fits))

# For every prediction point, see whether the true/simulated prevalence
# lies within the posterior with correct accuracy. i.e. 95% of the time, 
# the true value should lie within the 95% BCI. 
BCI_95 <- apply(fits_prev, 1, FUN=function(x){quantile(x, prob = c(0.025, 0.975))})
within_BCI <- c()
for(i in 1:nrow(prediction_data)){
  within_BCI <- c(within_BCI, (prediction_data$prev[i] >= BCI_95[1,i] &
                                  prediction_data$prev[i] <= BCI_95[2,i]))
}
mean(within_BCI)

Answer 1

UPDATE:

Having run this experiment a number of times, the relationship between the mean and the coverage is less extreme than in the example above.

Also, above, I have used the default k (ie s(x,y) ) and using gam.check it suggests that k is sufficiently high. But if you use a higher k (ie s(x, y, k=100) ), allowing the spline to be more wiggly, then the prediction interval is naturally wider (ie more uncertain) and the coverage improves. Coverage is still pretty variable, but it is much better.

Would love to hear others' thoughts.

Coverage of posterior from mgcv

Question

1 answers

solution1
0 2020-02-27 00:42:54

Coverage of posterior from mgcv

Question

1 answers

solution1 0 2020-02-27 00:42:54

solution1
0 2020-02-27 00:42:54