[英]Drop levels to treat 2 of them as a Control Case. Problems with regression/modelling/statistics since its not dummy?
I've stumbled upon a doubt about using droplevels in my dataset.我偶然发现了对在我的数据集中使用 droplevels 的疑问。 I have 4 factors in my "Disease column".
我的“疾病专栏”中有 4 个因素。
BD$Etiología <- factor(BD$Etiología, levels=c(0,1,2,3,4) ,
labels= c("Control","Idiop","LMNA","BAG3","Isquémica"), ordered=FALSE)
Then i make a subset in order to just compare the Control Cases vs 1 of the diseases.然后我制作一个子集,以便仅比较对照病例与 1 种疾病。
BD_C_ID <- subset(BD, Etiología=="Control" | Etiología=="Idiop")
BD_C_ID$Etiología= droplevels(BD_C_ID$Etiología)
BD_C_ID$Etiología
[1] Control Control Control Control Control Control Control Idiop Idiop Control Control Control
[13] Control Idiop Idiop Idiop Idiop Idiop Idiop Idiop Idiop Idiop Idiop Idiop
[25] Idiop Idiop Control Control Control Control Idiop Control Control Control Control Control
[37] Idiop Idiop Idiop Idiop
Levels: Control Idiop
Since the first factor was unordered, and i just drop the levels i don't use.由于第一个因素是无序的,我只是降低了我不使用的水平。 Could i treat them as a 0-1 coded value in order to use them in a
lm
, or a logistic regression?我可以将它们视为 0-1 编码值以便在
lm
或逻辑回归中使用它们吗? Or will there be a problem?或者会不会有问题?
Also, does that apply if i use the Control VS BAG3 (0-3 in the initial code?)?另外,如果我使用 Control VS BAG3(初始代码中的 0-3?),这是否适用? Or will i need to re-level them so its 0-1 re-applying factors?
还是我需要重新调整它们,使其 0-1 重新应用因子?
Short answer is it doesn't matter.简短的回答是没关系。 If you use them in a linear model
lm
or logistic regression, the model will use the first level as a reference level, so in this case, it is always "Control"
.如果您在线性 model
lm
或逻辑回归中使用它们,则 model 将使用第一个级别作为参考级别,因此在这种情况下,它始终是"Control"
。 The droplevels()
is good if you need to perform some functions with the factors, but if it is purely for lm()
or glm()
, these functions takes care of the factors underneath.如果您需要使用因子执行某些功能,则
droplevels()
很好,但如果它纯粹用于lm()
或glm()
,则这些函数会处理下面的因子。
To illustrate this using your example:为了说明这一点,使用您的示例:
set.seed(111)
BD = data.frame(
Etiologia = sample(0:4,100,replace=TRUE),
x = rnorm(100),
y = rnorm(100)
)
We can just do:我们可以这样做:
BD$E <- factor(BD$Etiologia,levels=0:4,
labels= c("Control","Idiop","LMNA","BAG3","Isquemica"))
lm(y ~ x + E,data=subset(BD,E %in% c("Control","Idiop")))
Call:
lm(formula = y ~ x + E, data = subset(BD, E %in% c("Control", "Idiop")))
Coefficients:
(Intercept) x EIdiop
-0.05524 0.21596 0.30433
And using another comparison:并使用另一个比较:
lm(y ~ x + E,data=subset(BD,E %in% c("Control","BAG3")))
Call:
lm(formula = y ~ x + E, data = subset(BD, E %in% c("Control",
"BAG3")))
Coefficients:
(Intercept) x EBAG3
-0.03355 0.08978 -0.21708
You get the same result if you do:如果你这样做,你会得到相同的结果:
BD$Etiologia <- factor(BD$Etiologia, levels=c(0,1,2,3,4) ,
labels= c("Control","Idiop","LMNA","BAG3","Isquemica"), ordered=FALSE)
BD_C_ID <- droplevels(subset(BD, Etiologia=="Control" | Etiologia=="Idiop"))
lm(y ~ x + Etiologia,data=BD_C_ID)
Call:
lm(formula = y ~ x + Etiologia, data = BD_C_ID)
Coefficients:
(Intercept) x EtiologiaIdiop
-0.05524 0.21596 0.30433
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.