简体   繁体   English

如何在R中使用二进制响应和两个分类变量分析数据

[英]How to analyze data with a binary response and two categorical variables in R

I have a set of data with a binary response (0 and 1) and two categorical variables (one with two levels and the other with four levels). 我有一组具有二进制响应(0和1)和两个分类变量的数据(一个具有两个级别,另一个具有四个级别)。

library(data.table)

data<-data.table(Factor1=rep(c("A","B","C","D"),each=36),
                 Factor2=rep(c(rep("Red",18),rep("Blue",18)),4),
                 Response=rep(c(rep(1,11),rep(0,7),rep(0,18)),4))

I´ve trying to analize this with with glm() but I'm not sure is the best way. 我试图用glm()对此进行分析,但是我不确定这是最好的方法。

model<-glm(Response~Factor1+Factor2,family = binomial(),data=data)
summary(model)

Call:
glm(formula = Response ~ Factor1 + Factor2, family = binomial(), 
data = data)

Deviance Residuals: 
 Min        1Q    Median        3Q       Max  
-1.37438  -0.00008  -0.00008   0.99245   0.99245  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.957e+01  1.267e+03  -0.015    0.988
Factor1B     8.942e-15  6.838e-01   0.000    1.000
Factor1C     7.681e-15  6.838e-01   0.000    1.000
Factor1D     7.345e-15  6.838e-01   0.000    1.000
Factor2Red   2.002e+01  1.267e+03   0.016    0.987

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 177.264  on 143  degrees of freedom
Residual deviance:  96.228  on 139  degrees of freedom
AIC: 106.23

Number of Fisher Scoring iterations: 18

According to this, none of the coefficients are significant. 据此,所有系数都不重要。 But I see the data and evidently there is a difference between "Red" and "Blue". 但是我看到了数据,显然“红色”和“蓝色”之间存在差异。

data[,sum(Response),by=c("Factor1","Factor2")]

   Factor1 Factor2 V1
1:       A     Red 11
2:       A    Blue  0
3:       B     Red 11
4:       B    Blue  0
5:       C     Red 11
6:       C    Blue  0
7:       D     Red 11
8:       D    Blue  0

I was expecting that the coeffcient Factor2Red was significant but it was not that way. 我原以为效率高的Factor2Red很重要,但事实并非如此。 I think that maybe is because of the high estandard error for this coefficient. 我认为这可能是由于该系数的高标准误差。

If I check the odds ratio I see that the value for this coefficient is very high. 如果检查比值比,我会发现该系数的值非常高。 But I do not know if that's enough to say that there is a significant effect of being red or blue. 但是我不知道这是否足以说明红色或蓝色会产生重大影响。

exp(cbind(coef(model)))

                    [,1]
(Intercept) 3.181005e-09
Factor1B    1.000000e+00
Factor1C    1.000000e+00
Factor1D    1.000000e+00
Factor2Red  4.940037e+08

Would you recommend another way to analyze this? 您会建议另一种分析方法吗?

Factor 2 Red vs. Blue is significant. 因子2红色vs.蓝色非常重要。 I believe the logistic model may be unstable because the mean and standard deviation of the Response of Factor2 = Blue is 0. You can run Fisher's exact test -- see documentation at https://stat.ethz.ch/R-manual/R-devel/library/stats/html/fisher.test.html 我相信逻辑模型可能不稳定,因为Factor2 = Blue的响应的平均值和标准偏差为0。您可以运行Fisher的精确检验-请参阅https://stat.ethz.ch/R-manual/R上的文档-devel /库/统计/ HTML / fisher.test.html

Try this: 尝试这个:

fisher.test(data$Factor2, data$Response, conf.level = 0.95)$conf.int

Here is an informative plot: 这是一个有用的图:

library(ggplot2)
data$Factor1Factor2 <- interaction(data$Factor1, data$Factor2)
ggplot(data, aes(x = Factor1Factor2, y = Response, fill = Factor1)) + 
geom_boxplot()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM