[英]How to analyze data with a binary response and two categorical variables in R
I have a set of data with a binary response (0 and 1) and two categorical variables (one with two levels and the other with four levels). 我有一组具有二进制响应(0和1)和两个分类变量的数据(一个具有两个级别,另一个具有四个级别)。
library(data.table)
data<-data.table(Factor1=rep(c("A","B","C","D"),each=36),
Factor2=rep(c(rep("Red",18),rep("Blue",18)),4),
Response=rep(c(rep(1,11),rep(0,7),rep(0,18)),4))
I´ve trying to analize this with with glm()
but I'm not sure is the best way. 我试图用
glm()
对此进行分析,但是我不确定这是最好的方法。
model<-glm(Response~Factor1+Factor2,family = binomial(),data=data)
summary(model)
Call:
glm(formula = Response ~ Factor1 + Factor2, family = binomial(),
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.37438 -0.00008 -0.00008 0.99245 0.99245
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.957e+01 1.267e+03 -0.015 0.988
Factor1B 8.942e-15 6.838e-01 0.000 1.000
Factor1C 7.681e-15 6.838e-01 0.000 1.000
Factor1D 7.345e-15 6.838e-01 0.000 1.000
Factor2Red 2.002e+01 1.267e+03 0.016 0.987
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 177.264 on 143 degrees of freedom
Residual deviance: 96.228 on 139 degrees of freedom
AIC: 106.23
Number of Fisher Scoring iterations: 18
According to this, none of the coefficients are significant. 据此,所有系数都不重要。 But I see the data and evidently there is a difference between "Red" and "Blue".
但是我看到了数据,显然“红色”和“蓝色”之间存在差异。
data[,sum(Response),by=c("Factor1","Factor2")]
Factor1 Factor2 V1
1: A Red 11
2: A Blue 0
3: B Red 11
4: B Blue 0
5: C Red 11
6: C Blue 0
7: D Red 11
8: D Blue 0
I was expecting that the coeffcient Factor2Red was significant but it was not that way. 我原以为效率高的Factor2Red很重要,但事实并非如此。 I think that maybe is because of the high estandard error for this coefficient.
我认为这可能是由于该系数的高标准误差。
If I check the odds ratio I see that the value for this coefficient is very high. 如果检查比值比,我会发现该系数的值非常高。 But I do not know if that's enough to say that there is a significant effect of being red or blue.
但是我不知道这是否足以说明红色或蓝色会产生重大影响。
exp(cbind(coef(model)))
[,1]
(Intercept) 3.181005e-09
Factor1B 1.000000e+00
Factor1C 1.000000e+00
Factor1D 1.000000e+00
Factor2Red 4.940037e+08
Would you recommend another way to analyze this? 您会建议另一种分析方法吗?
Factor 2 Red vs. Blue is significant. 因子2红色vs.蓝色非常重要。 I believe the logistic model may be unstable because the mean and standard deviation of the Response of Factor2 = Blue is 0. You can run Fisher's exact test -- see documentation at https://stat.ethz.ch/R-manual/R-devel/library/stats/html/fisher.test.html
我相信逻辑模型可能不稳定,因为Factor2 = Blue的响应的平均值和标准偏差为0。您可以运行Fisher的精确检验-请参阅https://stat.ethz.ch/R-manual/R上的文档-devel /库/统计/ HTML / fisher.test.html
Try this: 尝试这个:
fisher.test(data$Factor2, data$Response, conf.level = 0.95)$conf.int
Here is an informative plot: 这是一个有用的图:
library(ggplot2)
data$Factor1Factor2 <- interaction(data$Factor1, data$Factor2)
ggplot(data, aes(x = Factor1Factor2, y = Response, fill = Factor1)) +
geom_boxplot()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.