Variable sample size per cluster/group in mixed effects logistic regression

Question

I am attempting to run mixed effects logistic regression models, yet am concerned about the variable samples sizes in each cluster/group, and also the very low number of "successes" in some models.

I have ~ 700 trees distributed across 163 field plots (ie, the cluster/group), visited annually from 2004-11. I am fitting separate mixed effects logistic regression models (hereafter GLMMs) for each year of the study to compare this output to inference from a shared frailty model (ie, survival analysis with random effect).

The number of trees per plot varies from 1-22. Also, some years have a very low number of "successes" (ie, diseased trees). For example, in 2011 there were only 4 successes out of 694 "failures" (ie, healthy trees).

My questions are: (1) is there a general rule for the ideal number of samples|group when the inference focus is only on estimating the fixed effects in the GLMM, and (2) are GLMMs stable when there is such an extreme difference in the ratio of successes:failures.

Thank you for any advice or suggestions of sources.

-Sarah

Answer 1

(Hi, Sarah, sorry I didn't answer previously via e-mail ...)

It's hard to answer these questions in general -- you're stuck with your data, right? So it's not a question of power analysis. If you want to make sure that your results will be reasonably reliable, probably the best thing to do is to run some simulations. I'm going to show off a fairly recent feature of lme4 (in the development version 1.1-1, on Github), which is to simulate data from a GLMM given a formula and a set of parameters.

First I have to simulate the predictor variables (you wouldn't have to do this, since you already have the data -- although you might want to try varying the range of number of plots, trees per plot, etc.).

set.seed(101)
## simulate number of trees per plot
## want mean of 700/163=4.3 trees, range=1-22
## by trial and error this is about right
r1 <- rnbinom(163,mu=3.3,size=2)+1
## generate plots and trees within plots
d <- data.frame(plot=factor(rep(1:163,r1)),
            tree=factor(unlist(lapply(r1,seq))))
## expand by year
library(plyr)
d2 <- ddply(d,c("plot","tree"),
        transform,year=factor(2004:2011))

Now set up the parameters: I'm going to assume year is a fixed effect and that overall disease incidence is plogis(-2)=0.12 except in 2011 when it is plogis(-2-3)=0.0067 . The among-plot standard deviation is 1 (on the logit scale), as is the among-tree-within-plot standard deviation:

beta <- c(-2,0,0,0,0,0,0,-3)
theta <- c(1,1)  ## sd by plot and plot:tree

Now simulate: year as fixed effect, plot and tree-within-plot as random effects

library(lme4)
s1 <- simulate(~year+(1|plot/tree),family=binomial,
     newdata=d2,newparams=list(beta=beta,theta=theta))
d2$diseased <- s1[[1]]

Summarize/check:

d2sum <- ddply(d2,c("year","plot"),
           summarise,
           n=length(tree),
           nDis=sum(diseased),
           propDis=nDis/n)
library(ggplot2)
library(Hmisc)  ## for mean_cl_boot
theme_set(theme_bw())
ggplot(d2sum,aes(x=year,y=propDis))+geom_point(aes(size=n),alpha=0.3)+
    stat_summary(fun.data=mean_cl_boot,colour="red")

Now fit the model:

g1 <- glmer(diseased~year+(1|plot/tree),family=binomial,
        data=d2)
fixef(g1)

You can try this many times and see how often the results are reliable ...

Answer 2

As Josh said, this is a better questions for CrossValidated .

There are no hard and fast rules for logistic regression, but one rule of thumb is 10 successes and 10 failures are needed per cell in the design (cluster in this case) times the number continuous variables in the model.

In your case, I would think the model, if it converges, would be unstable. You can examine that by bootstrapping the errors of the estimates of the fixed effects.

Variable sample size per cluster/group in mixed effects logistic regression

Question

2 answers

solution1
1 2013-11-08 15:05:24

solution2
0 2013-11-08 01:51:30

Variable sample size per cluster/group in mixed effects logistic regression

Question

2 answers

solution1 1 2013-11-08 15:05:24

solution2 0 2013-11-08 01:51:30

solution1
1 2013-11-08 15:05:24

solution2
0 2013-11-08 01:51:30