[英]Role of raw data in pooled estimates from mice (R package)?
I'm wondering what is the role of the original data set when using the mice package in R for imputed data. 我想知道当使用R中的mouses包获取估算数据时,原始数据集的作用是什么。 I need to impute my data and then compute some additional variables before turning the long data set back into an as.mids object.
我需要先估算数据,然后计算一些其他变量,然后再将长数据集转回as.mids对象。 I noticed that when computing my additional variable ("total" in the code below) that whether I used
na.rm=TRUE
affected by estimates and from my understanding, it shouldn't. 我注意到,在计算我的附加变量(以下代码中的“总计”)时,我是否使用
na.rm=TRUE
受到估计的影响,并且根据我的理解,应该不会。 Here's a reproducible example: 这是一个可重现的示例:
# Add required package
require(mice)
# Impute data and compute summary with na.rm=T
imp1 <- mice(nhanes, seed = 123)
com1 <- complete(imp1, "long", include = TRUE)
head(com1)
com1$total <- rowSums(com1[4:6],na.rm=T)
imp2 <- as.mids(com1)
# Fit model with data using na.rm=T
fit <- with(imp2, lm(bmi ~ age))
round(summary(pool(fit)), 2)
Notice that my variable "total" is the rowSums of 3 variables and I've used na.rm=TRUE
. 注意,我的变量“ total”是3个变量的rowSums,并且我使用了
na.rm=TRUE
。 However, as only the original data set (denoted by the variable ".imp" in the long data set contains NA values, this extra bit of code should only be relevant for the original data. Removing na.rm=TRUE
shows that this is not true: 但是,由于只有原始数据集(由长数据集中的变量“ .imp”表示包含NA值),所以此额外的代码位仅应与原始数据相关。删除
na.rm=TRUE
表示这是不对:
# Impute data and compute summary without na.rm=T
imp3 <- mice(nhanes, seed = 123)
com2 <- complete(imp3, "long", include = TRUE)
head(com2)
com2$total <- rowSums(com2[4:6])
imp4 <- as.mids(com2)
# Fit model with data without using na.rm=T
fit2 <- with(imp4, lm(bmi ~ age))
round(summary(pool(fit2)), 2)
Again, notice that leaving out na.rm=TRUE
leads to different estimates. 同样,请注意,
na.rm=TRUE
会导致不同的估计。 The only difference here is that the variable "total" now has NA values when the variable .imp is equal to zero (ie, the original data set). 唯一的区别是,当变量.imp等于零(即原始数据集)时,变量“总计”现在具有NA值。
What am I missing? 我想念什么? I would have thought that only the imputed data would have affected the pooled estimates, while I just showed that values in the original data set do (ie, those from .imp = 0).
我本以为只有估算的数据会影响合并的估计,而我只是表明原始数据集中的值确实会影响(即来自.imp = 0的值)。 What is the role of the original data set in getting pooled estimates from mice?
原始数据集在从小鼠收集汇总估计值中起什么作用?
NOTE: EDITED FOR CLARITY 注意:为清晰起见而编辑
I would imagine that the original (raw) data plays no role. 我可以想象原始(原始)数据不起作用。 According to the
as.mids
help page it is only needed to signify where the missing data is. 根据
as.mids
帮助页面,仅需要as.mids
丢失的数据在哪里。 I ran your script and noticed there was an error when creating imp2
. 我运行了您的脚本,发现创建
imp2
时出现错误。 You call on the object com
which should be com1
. 您调用对象
com
,该对象应为com1
。 After correction get the exact same results for the two approaches: 校正后,两种方法可获得完全相同的结果:
# Add required package
require(mice)
# Impute data and compute summary with na.rm=T
imp1 <- mice(nhanes, seed = 123)
com1 <- complete(imp1, "long", include = TRUE)
head(com1)
com1$total <- rowSums(com1[4:6],na.rm=T)
imp2 <- as.mids(com1)
# Fit model with data using na.rm=T
fit <- with(imp2, lm(bmi ~ age))
# Impute data and compute summary without na.rm=T
imp3 <- mice(nhanes, seed = 123)
com2 <- complete(imp3, "long", include = TRUE)
head(com2)
com2$total <- rowSums(com2[4:6])
imp4 <- as.mids(com2)
# Fit model with data without using na.rm=T
fit2 <- with(imp4, lm(bmi ~ age))
The results: 结果:
> round(summary(pool(fit)), 2)
estimate std.error statistic df p.value
(Intercept) 29.76 1.86 15.98 18.61 0.00
age -1.73 0.95 -1.83 19.50 0.08
> round(summary(pool(fit2)), 2)
estimate std.error statistic df p.value
(Intercept) 29.76 1.86 15.98 18.61 0.00
age -1.73 0.95 -1.83 19.50 0.08
In short I think the different results may be due to an error in your code. 简而言之,我认为不同的结果可能是由于您的代码错误所致。 I used
mice 3.0.9
我用的是
mice 3.0.9
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.