简体   繁体   中英

`ddply` fails to apply logistic regression (GLM) by group to my dataset

I'm working out the LD50 (lethal dosage) for multiple populations from different experiments using the MASS package. It's simple enough when I subset the data and do one at a time, but I'm getting an error when I use ddply . Essentially I need an LD50 for each population at each temperature.

My data looks somewhat like this:

# dput(d)
d <- structure(list(Pop = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L), .Label = c("a", "b", "c"), class = "factor"), Temp = structure(c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("high", "low"), class = "factor"), 
Dose = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), Dead = c(0L, 
11L, 12L, 14L, 2L, 16L, 17L, 7L, 5L, 3L, 17L, 15L, 9L, 20L, 
8L, 19L, 7L, 2L, 20L, 14L, 9L, 15L, 1L, 15L), Alive = c(20L, 
9L, 8L, 6L, 18L, 4L, 3L, 13L, 15L, 17L, 3L, 5L, 11L, 0L, 
12L, 1L, 13L, 18L, 0L, 6L, 11L, 5L, 19L, 5L)), .Names = c("Pop", 
"Temp", "Dose", "Dead", "Alive"), class = "data.frame", row.names = c(NA, 
-24L))

The following works fine:

d$Mortality <- cbind(d$Alive, d$Dead)
a <- d[d$Pop=="a" & d$Temp=="high",]
library(MASS)
dose.p(glm(Mortality ~ Dose, family="binomial", data=a), p=0.5)[1]

But when I put this into ddply I get the following error:

library(plyr)
d$index <- paste(d$Pop, d$Temp, sep="_")
ddply(d, 'index', function(x) dose.p(glm(Mortality~Dose, family="binomial", data=x), p=0.5)[1])

Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1

I can get the right LD50 when I use a proportion but can't figure out where I've gone wrong with my approach (and had already written this question).

Perhaps this will amaze you. But if you choose to use formula

cbind(Alive, Dead) ~ Dose

instead of

Mortality ~ Dose

the problem will be gone.


library(MASS)
library(plyr)

## `d` is as your `dput` result

## a function to apply
f <- function(x) {
  fit <- glm(cbind(Alive, Dead) ~ Dose, family = "binomial", data = x)
  dose.p(fit, p=0.5)[[1]]
  }

## call `ddply`
ddply(d, .(Pop, Temp), f)

#  Pop Temp        V1
#1   a high 2.6946257
#2   a  low 2.1834099
#3   b high 2.5000000
#4   b  low 0.4830998
#5   c high 2.2899553
#6   c  low 2.5000000

So what happened with Mortality ~ Dose ? Let's set .inform = TRUE when calling ddply :

## `d` is as your `dput` result
d$Mortality <- cbind(d$Alive, d$Dead)

## a function to apply
g <- function(x) {
  fit <- glm(Mortality ~ Dose, family = "binomial", data = x)
  dose.p(fit, p=0.5)[[1]]
  }

## call `ddply`
ddply(d, .(Pop, Temp), g, .inform = TRUE)

#Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
#Error: with piece 1: 
#  Pop Temp Dose Dead Alive Mortality
#1   a high    1    0    20        20
#2   a high    2   11     9         9
#3   a high    3   12     8         8
#4   a high    4   14     6         6

Now we we see that variable Mortality has lost dimension, and only the first column ( Alive ) is retained. For a glm with binomial response, if the response is a single vector, glm expects 0-1 binary or a factor of two levels. Now, we have integers 20, 9, 8, 6, ..., hence glm will complain

Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1

There is really no way to fix this issue. I have tried using a protector:

d$Mortality <- I(cbind(d$Alive, d$Dead))

but it still ends up with the same failure.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM