降低水平以将其中 2 个视为控制案例。回归/建模/统计问题，因为它不是虚拟的？

Question

I've stumbled upon a doubt about using droplevels in my dataset.我偶然发现了对在我的数据集中使用 droplevels 的疑问。 I have 4 factors in my "Disease column".我的“疾病专栏”中有 4 个因素。

BD$Etiología <- factor(BD$Etiología, levels=c(0,1,2,3,4) ,
labels= c("Control","Idiop","LMNA","BAG3","Isquémica"), ordered=FALSE)

Then i make a subset in order to just compare the Control Cases vs 1 of the diseases.然后我制作一个子集，以便仅比较对照病例与 1 种疾病。

BD_C_ID <- subset(BD, Etiología=="Control" | Etiología=="Idiop")

BD_C_ID$Etiología= droplevels(BD_C_ID$Etiología) 

BD_C_ID$Etiología

[1] Control Control Control Control Control Control Control Idiop   Idiop   Control Control Control
[13] Control Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop  
[25] Idiop   Idiop   Control Control Control Control Idiop   Control Control Control Control Control
[37] Idiop   Idiop   Idiop   Idiop  
Levels: Control Idiop

Since the first factor was unordered, and i just drop the levels i don't use.由于第一个因素是无序的，我只是降低了我不使用的水平。 Could i treat them as a 0-1 coded value in order to use them in a lm , or a logistic regression?我可以将它们视为 0-1 编码值以便在lm或逻辑回归中使用它们吗？ Or will there be a problem?或者会不会有问题？

Also, does that apply if i use the Control VS BAG3 (0-3 in the initial code?)?另外，如果我使用 Control VS BAG3（初始代码中的 0-3？），这是否适用？ Or will i need to re-level them so its 0-1 re-applying factors?还是我需要重新调整它们，使其 0-1 重新应用因子？

Answer 1

Short answer is it doesn't matter.简短的回答是没关系。 If you use them in a linear model lm or logistic regression, the model will use the first level as a reference level, so in this case, it is always "Control" .如果您在线性 model lm或逻辑回归中使用它们，则 model 将使用第一个级别作为参考级别，因此在这种情况下，它始终是"Control" 。 The droplevels() is good if you need to perform some functions with the factors, but if it is purely for lm() or glm() , these functions takes care of the factors underneath.如果您需要使用因子执行某些功能，则droplevels()很好，但如果它纯粹用于lm()或glm() ，则这些函数会处理下面的因子。

To illustrate this using your example:为了说明这一点，使用您的示例：

set.seed(111)
BD = data.frame(
          Etiologia = sample(0:4,100,replace=TRUE),
          x = rnorm(100),
          y = rnorm(100)
                )

We can just do:我们可以这样做：

BD$E <- factor(BD$Etiologia,levels=0:4,
labels= c("Control","Idiop","LMNA","BAG3","Isquemica"))

lm(y ~ x + E,data=subset(BD,E %in% c("Control","Idiop")))

Call:
lm(formula = y ~ x + E, data = subset(BD, E %in% c("Control", "Idiop")))

Coefficients:
(Intercept)            x       EIdiop  
   -0.05524      0.21596      0.30433

And using another comparison:并使用另一个比较：

lm(y ~ x + E,data=subset(BD,E %in% c("Control","BAG3")))

     Call:
lm(formula = y ~ x + E, data = subset(BD, E %in% c("Control", 
    "BAG3")))

Coefficients:
(Intercept)            x        EBAG3  
   -0.03355      0.08978     -0.21708

You get the same result if you do:如果你这样做，你会得到相同的结果：

BD$Etiologia <- factor(BD$Etiologia, levels=c(0,1,2,3,4) ,
labels= c("Control","Idiop","LMNA","BAG3","Isquemica"), ordered=FALSE)

BD_C_ID <- droplevels(subset(BD, Etiologia=="Control" | Etiologia=="Idiop"))

lm(y ~ x + Etiologia,data=BD_C_ID)

Call:
lm(formula = y ~ x + Etiologia, data = BD_C_ID)

Coefficients:
   (Intercept)               x  EtiologiaIdiop  
      -0.05524         0.21596         0.30433

降低水平以将其中 2 个视为控制案例。回归/建模/统计问题，因为它不是虚拟的？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-21 11:33:56

降低水平以将其中 2 个视为控制案例。 回归/建模/统计问题，因为它不是虚拟的？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-21 11:33:56

降低水平以将其中 2 个视为控制案例。回归/建模/统计问题，因为它不是虚拟的？

解决方案1
1 已采纳 2021-02-21 11:33:56