简体   繁体   中英

Generating predictive simulations from a multilevel model with random intercepts

I am struggling to understand how, in R, to generate predictive simulations for new data using a multilevel linear regression model with a single set of random intercepts. Following the example on pp. 146-147 of this text , I can execute this task for a simple linear model with no random effects. What I can't wrap my head around is how to extend the set-up to accommodate random intercepts for a factor added to that model.

I'll use iris and some fake data to show where I'm getting stuck. I'll start with a simple linear model:

mod0 <- lm(Sepal.Length ~ Sepal.Width, data = iris)

Now let's use that model to generate 1,000 predictive simulations for 250 new cases. I'll start by making up those cases:

set.seed(20912)
fakeiris <- data.frame(Sepal.Length = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
                       Sepal.Width = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
                       Species = sample(as.character(unique(iris$Species)), 250, replace = TRUE),
                       stringsAsFactors=FALSE)

Following the example in the aforementioned text, here's what I do to get 1,000 predictive simulations for each of those 250 new cases:

library(arm)
n.sims = 1000  # set number of simulations
n.tilde = nrow(fakeiris)  # set number of cases to simulate
X.tilde <- cbind(rep(1, n.tilde), fakeiris[,"Sepal.Width"])  # create matrix of predictors describing those cases; need column of 1s to multiply by intercept
sim.fakeiris <- sim(mod0, n.sims)  # draw the simulated coefficients
y.tilde <- array(NA, c(n.sims, n.tilde))  # build an array to hold results
for (s in 1:n.sims) { y.tilde[s,] <- rnorm(n.tilde, X.tilde %*% sim.fakeiris@coef[s,], sim.fakeiris@sigma[s]) }  # use matrix multiplication to fill that array

That works fine, and now we can do things like colMeans(y.tilde) to inspect the central tendencies of those simulations, and cor(colMeans(y.tilde), fakeiris$Sepal.Length) to compare them to the (fake) observed values of Sepal.Length.

Now let's try an extension of that simple model in which we assume that the intercept varies across groups of observations --- here, species. I'll use lmer() from the lme4 package to estimate a simple multilevel/hierarchical model that matches that description:

library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)

Okay, that works, but now what? I run:

sim.fakeiris.lmer <- sim(mod1, n.sims)

When I use str() to inspect the result, I see that it is an object of class sim.merMod with three components:

  • @fixedef , a 1,000 x 2 matrix with simulated coefficients for the fixed effects (the intercept and Sepal.Width)

  • @ranef , a 1,000 x 3 matrix with simulated coefficients for the random effects (the three species)

  • @sigma , a vector of length 1,000 containing the sigmas associated with each of those simulations

I can't wrap my head around how to extend the matrix construction and multiplication used for the simple linear model to this situation, which adds another dimension. I looked in the text, but I could only find an example (pp. 272-275) for a single case in a single group (here, species). The real-world task I'm aiming to perform involves running simulations like these for 256 new cases (pro football games) evenly distributed across 32 groups (home teams). I'd greatly appreciate any assistance you can offer.

Addendum . Stupidly, I hadn't looked at the details on simulate.merMod() in lme4 before posting this. I have now. It seems like it should do the trick, but when I run simulate(mod0, nsim = 1000, newdata = fakeiris) , the result has only 150 rows. The values look sensible, but there are 250 rows (cases) in fakeiris . Where is that 150 coming from?

One possibility is to use the predictInterval function from the merTools package. The package is about to be submitted to CRAN, but the current developmental release is available for download from GitHub,

    install.packages("devtools")
    devtools::install_github("jknowles/merTools")

To get the median and a 95% credible interval of 100 simulations:

    mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)

    out <- predictInterval(mod1, newdata=fakeiris, level=0.95,
                           n.sims=100, stat="median")

By default, predictInterval includes the residual variation, but you can turn that feature off with:

    out2 <- predictInterval(mod1, newdata=fakeiris, level=0.95,
                           n.sims=100, stat="median", 
                           include.resid.var=FALSE)

Hope this helps!

This might help: it doesn't use sim() , but instead uses mvrnorm() to draw the new coefficients from the sampling distribution of the fixed-effect parameters, uses a bit of internal machinery ( setBeta0 ) to reassign the internal values of the fixed-effect coefficients. The internal values of the random effect coefficients are automatically resampled by simulate.merMod using the default argument re.form=NA . However, the residual variance is not resampled -- it is held fixed across the simulations, which isn't 100% realistic.

In your use case, you would specify newdata=fakeiris .

library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
simfun <- function(object,n=1,newdata=NULL,...) {
    v <- vcov(object)
    b <- fixef(object)
    betapars <- MASS::mvrnorm(n,mu=b,Sigma=v)
    npred <- if (is.null(newdata)) {
                 length(predict(object))
             } else nrow(newdata)
    res <- matrix(NA,npred,n)
    for (i in 1:n) {
        mod1@pp$setBeta0(betapars[i,])
        res[,i] <- simulate(mod1,newdata=newdata,...)[[1]]
    }
    return(res)
}
ss <- simfun(mod1,100)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM