简体   繁体   English

使用MICE包导致缺失值插补错误

[英]Error in missing value imputation using MICE package

I have a huge data (4M x 17) that has missing values. 我有一个巨大的数据(4M x 17) ,缺少值。 Two columns are categorical, rest all are numerical. 两列是分类,其余都是数字。 I want to use MICE package for missing value imputation. 我想使用MICE包来减少价值。 This is what I tried: 这是我试过的:

> testMice <- mice(myData[1:100000,]) # runs fine  
> testTot <- predict(testMice, myData)
Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "mids"

Running the imputation on whole dataset was computationally expensive, so I ran it on only the first 100K observations. 在整个数据集上运行估算是计算上昂贵的,所以我只在前100K观测值上运行它。 Then I am trying to use the output to impute the whole data. 然后我试图使用输出来估算整个数据。

Is there anything wrong with my approach? 我的方法有什么问题吗? If yes, what should I do to make it correct? 如果是,我该怎么做才能使其正确? If no, then why am I getting this error? 如果不是,那么为什么我会收到此错误?

Neither mice nor hmisc provide the parameter estimates from the imputation process. micehmisc都没有提供来自插补过程的参数估计。 Both Amelia and imputeMulti do. AmeliaimputeMulti都做到了。 In both cases, you can extract the parameter estimates and use them for imputing your other observations. 在这两种情况下,您都可以提取参数估计值并使用它们来估算其他观测值。

  • Amelia assumes your data are distributed as a multivariate normal (eg. X \\sim N(\\mu, \\Sigma). Amelia假设您的数据以多变量法线分布(例如X \\ sim N(\\ mu,\\ Sigma)。
  • imputeMulti assumes that your data is distributed as a multivariate multinomial distribution. imputeMulti假设您的数据是作为多元多项分布分发的。 That is the complete cell counts are distributed (X \\sim M(n,\\theta)) where n is the number of observations. 这就是分配完整的细胞计数(X \\ sim M(n,\\ theta)),其中n是观察数。

Fitting can be done as follows, via example data. 可以通过示例数据如下进行拟合。 Examining parameter estimates is shown further below. 检查参数估计值如下所示。

library(Amelia)
library(imputeMulti)
data(tract2221, package= "imputeMulti")
test_dat2 <- tract2221[, c("gender", "marital_status","edu_attain", "emp_status")]
# fitting
IM_EM <- multinomial_impute(test_dat2, "EM",conj_prior = "non.informative", verbose= TRUE)
amelia_EM <- amelia(test_dat2, m= 1, noms= c("gender", "marital_status","edu_attain", "emp_status"))
  • The parameter estimates from the amelia function are found in amelia_EM$mu and amelia_EM$theta . amelia函数的参数估计值可在amelia_EM$muamelia_EM$theta
  • The parameter estimates in imputeMulti are found in IM_EM@mle_x_y and can be accessed via the get_parameters method. imputeMulti中的参数估计值可在IM_EM@mle_x_y imputeMulti中找到, IM_EM@mle_x_y通过get_parameters方法访问。

imputeMulti has noticeably higher imputation accuracy for categorical data relative to either of the other 3 packages, though it only accepts multinomial (eg. factor ) data. imputeMulti相对于其他3个包中的任何一个具有明显更高的分类数据的插补精度,尽管它只接受多项(例如factor )数据。

All of this information is in the currently unpublished vignette for imputeMulti . 所有这些信息都在imputeMulti当前未发布的插图中。 The paper has been submitted to JSS and I am awaiting a response before adding the vignette to the package. 该论文已提交给JSS,我正在等待响应,然后将晕影添加到包中。

You don't use predict() with mice . 你没有对mice使用predict() It's not a model you're fitting per se. 它本身并不适合您的模型。 Your imputed results are already there for the 100,000 rows. 您的推算结果已经存在100,000行。

If you want data for all rows then you have to put all rows in mice . 如果您想要所有行的数据,那么您必须将所有行放在mice I wouldn't recommend it though, unless you set it up on a large cluster with dozens of CPU cores. 我不推荐它,除非你在一个有几十个CPU核心的大型集群上进行设置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM