简体   繁体   English

使用小鼠进行插补后,我的观察结果比原始数据集多吗? 原始数据(估算)观察 = 27727 与估算(impdat)= 138635

[英]I have more observations than original dataset after imputation using mice ? original data (impute) observations = 27727 vs imputed (impdat) = 138635

data <- read.csv("Documents/ABA/dataset.csv")
df <- subset(data, select=c(k7, n3, n2a, d1a1x, k17, bmgc23g, m1a_corruption_pos, 
                            j30_permit_pos, bmge1, lcu, j30_instability_pos, 
                            bmgc25))

#filtering dataset for selected variable
impute <- df[c("k7","k17","d1a1x","bmgc23g", "m1a_corruption_pos", 
               "j30_permit_pos", "bmge1", "lcu", "j30_instability_pos",
               "bmgc25")]

tempData <- mice(impute, m=5, maxit=10, method="pmm", seed=500)

impdat <- complete(tempData, action="long", include=FALSE)

May I know what is wrong or how it can fixed?我可以知道出了什么问题或如何解决吗?

This is correct, First, you used mice(., m=5) (the default) to impute yout data set five times.这是正确的,首先,您使用mice(., m=5) (默认值)对您的数据集进行了五次估算。 Using complete(., action=long) , you combined all five imputations in a long format.使用complete(., action=long) ,您将所有五个插补组合成一个长格式。 To distinguish the individual imputations, two variables are added, .imp , which distinguishes between the five imputations, and .id which are the initial row names.为了区分各个插补,添加了两个变量.imp ,用于区分五个插补,以及.id ,它们是初始行名称。

library(mice)
imp <- mice(nhanes, m=3)

nhanes_imp <- complete(imp, action='long')
nhanes_imp
#     .imp .id age  bmi hyp chl
# 1      1   1   1 29.6   1 187
# 2      1   2   2 22.7   1 187
# 3      1   3   1 29.6   1 187
# [...]
# 26     2   1   1 22.7   1 118
# 27     2   2   2 22.7   1 187
# 28     2   3   1 30.1   1 187
# [...]
# 51     3   1   1 27.2   1 131
# 52     3   2   2 22.7   1 187
# 53     3   3   1 24.9   1 187
# [...]
# 76     4   1   1 22.0   1 113
# 77     4   2   2 22.7   1 187
# 78     4   3   1 22.0   1 187
# [...]
# 101    5   1   1 35.3   1 187
# 102    5   2   2 22.7   1 187
# 103    5   3   1 35.3   1 187
# [...]

Naturally your imputed data set has five times the number of rows than you initial one.自然,您的估算数据集的行数是初始数据集的五倍。

nrow(nhanes_imp) / nrow(nhanes)
# [1] 5

You should never use complete without action='long' ( see my older answer there ).你不应该在没有action='long'情况下使用 complete (请参阅我的旧答案)。

Continue by pooling your calculations.继续汇总您的计算。 For instance, for OLS you may use the pool() function, which comes with mice , that basically averages what lm is doing, over the five imputation versions.例如,对于 OLS,您可以使用 mouse 附带的pool() mice ,它基本上平均了lm在五个插补版本中所做的事情。

fit <- with(data=imp, exp=lm(bmi ~ hyp + chl))
summary(pool(fit))
#          term    estimate  std.error  statistic       df      p.value
# 1 (Intercept) 21.38468643 4.58030244  4.6688372 16.64367 0.0002323604
# 2         hyp -1.89607759 2.18239135 -0.8688073 19.00235 0.3957936019
# 3         chl  0.03942668 0.02449571  1.6095343 15.72940 0.1273825300

In case we mistakenly do OLS without pooling the imputed data sets, the number of observations is blown up to five times of it's actually size.如果我们在没有合并估算数据集的情况下错误地执行 OLS,观察的数量将被放大到实际大小的五倍。 Hence the degrees of freedom are to large, and the variance and all statistics depending on it underestimated:因此,自由度很大,而方差和所有依赖于它的统计数据都被低估了:

summary(lm(bmi ~ hyp + chl, nhanes_imp))
# Call:
# lm(formula = bmi ~ hyp + chl, data = nhanes_imp)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -6.9010 -2.7027  0.3682  3.0993  8.4682 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 21.165549   1.794706  11.793  < 2e-16 ***
# hyp         -1.920889   0.907041  -2.118   0.0362 *  
# chl          0.040573   0.009444   4.296 3.51e-05 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.934 on 122 degrees of freedom
# Multiple R-squared:  0.1346,  Adjusted R-squared:  0.1205 
# F-statistic: 9.492 on 2 and 122 DF,  p-value: 0.0001475

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM