简体   繁体   English

R(小鼠)中的多重插补 - 如何测试插补运行?

[英]Multiple imputation in R (mice) - How do I test imputation runs?

I work with a data set of 171 observations of 55 variables with 35 variables having NA's that I want to impute with the mice function:我使用包含 55 个变量的 171 个观察值的数据集,其中 35 个变量具有我想用鼠标函数估算的 NA:

imp_Data <- mice(Data,m=5,maxit=50,meth='pmm',seed=500)

 imp_Data$imp

Now, having the 5 imputation runs, I don't know how I can test and decide which of the 5 imputations is the best to choose for my data set.现在,运行了 5 个插补,我不知道如何测试和决定 5 个插补中的哪一个最适合我的数据集。

Checking for that topic I found again and again scripts using the with() function with a linear model and then the pool() function:检查该主题时,我一次又一次地找到了使用with()函数和线性模型然后使用pool()函数的脚本:

fit <- with(imp_Data, lm(a ~ b + c + d + e))

 combine <- pool(fit)

But I didn't understand for what this linear model is needed and how it helps me to find the best imputation run.但我不明白这个线性模型需要什么以及它如何帮助我找到最佳插补运行。

Can someone please tell me in a simple way how I can do a test of the 5 imputations / how I can decide which one to choose?有人可以简单地告诉我如何对 5 种估算进行测试/如何决定选择哪一种?

Thanks for helping!感谢您的帮助!

mice is a multiple imputation package.老鼠是一个多重插补包。 Multiple Imputation itself is not really a imputation algorithm - it is rather a concept how to impute data, while also accounting for the uncertainty that comes along with the imputation.多重插补本身并不是真正的插补算法——它更像是一个如何插补数据的概念,同时也说明了插补带来的不确定性。

If you just want one imputed dataset, you can use Single Imputation packages like VIM (eg the function irmi() or kNN() ).如果您只需要一个插补数据集,您可以使用单一插补包,如VIM (例如函数irmi()kNN() )。 Also the packages imputeR and missForest are good for Single Imputation.此外,包imputeRmissForest也适用于单一插补。 Thy output you one single imputed dataset.你的输出你一个单一的插补数据集。

If you still want to use mice and just want to have 1 imputed dataset at the end, you can either take just any of the five datasets or you can average between the five datasets.如果您仍然想使用鼠标并且只想在最后获得 1 个估算数据集,您可以只使用五个数据集中的任何一个,也可以在五个数据集之间取平均值。

There is a deeper reason, why multiple imputation creates multiple imputed datasets.有一个更深层次的原因,为什么多重插补会创建多个插补数据集。 The idea behind this is, that the imputation itself introduces bias.这背后的想法是,插补本身会引入偏见。 You can not really claim that a NA value you impute is eg exactly 5. The more correct answer from a bayesian point of view would be, the missing value is likely somewhere between 3 and 7. So if you just set it to 5 you introduce bias.您不能真正声称您估算的 NA 值恰好为 5。从贝叶斯的角度来看,更正确的答案是,缺失值可能介于 3 和 7 之间。因此,如果您将其设置为 5,则您引入偏见。

Multiple Imputation solves this problem by sampling from different probability distributions and in the end comes up with multiple imputed datasets, which are basically all possible solutions.多重插补通过从不同的概率分布中采样来解决这个问题,最终得到多个插补数据集,这些数据集基本上都是可能的解决方案。

The main idea of multiple imputation is now to take these five datasets, treat each as possible solution and you perform your analysis on each one!多重插补的主要思想现在是采用这五个数据集,将每个数据集视为可能的解决方案,然后对每个数据集进行分析! Afterwards your analysis results (and not the imputed datasets!) would be pooled together.之后,您的分析结果(而不是推算数据集!)将汇总在一起。

So the with() and the pooling() part have nothing to do with creating one dataset, they are needed for combining the five analysis results back together.所以 with() 和 pooling() 部分与创建一个数据集无关,它们需要将五个分析结果组合在一起。

The linear model is one form of analysis a lot of people apply to data.线性模型是很多人应用于数据的一种分析形式。 (they want to analyze relations of some variables to a response variable). (他们想分析一些变量与响应变量的关系)。 In order to get unbiased results, this analysis is done 5 times and then results are combined.为了得到无偏的结果,该分析进行了 5 次,然后将结果合并。

So if you don't want to use a linear model anyway you don't need this.因此,如果您无论如何都不想使用线性模型,则不需要它。 Because this part has to do with the analysis of the data and not with the imputation.因为这部分与数据分析有关,与插补无关。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM