[英]How to handle missing data (NA) in for loops in R
I am trying to calculate Chi Square discrepancies for the observed and simulated data and evaluate the model fit using Bayesian inference.我正在尝试计算观察数据和模拟数据的卡方差异,并使用贝叶斯推理评估模型拟合。 The observed dataset contains missing ("NA") values.
观察到的数据集包含缺失(“NA”)值。 However, there are no missing values for the simulated one.
但是,模拟的没有缺失值。 Thus, I am unable to compare the discrepancy stats between them.
因此,我无法比较它们之间的差异统计数据。
The code presented below is an example, which is similar to my work:下面给出的代码是一个示例,与我的工作类似:
p <- array(runif(3000*195*6, 0, 1), c(3000, 195, 6))
N <- array(rpois(3000*195, 10), c(3000, 195))
y <- array(0, c(195, 6))
for(j in 1:195){
for(k in 1:6){
y[j,k] <- (rbinom(1, N[j], p[1,j,k]))
}
}
foo <- runif(50, 1, 195)
bar <- runif(50, 1, 6)
for(i in 1:50){
y[foo[i], bar[i]] <- NA
}
The code derives the response variable y including some missing values ("NA").该代码导出响应变量 y,其中包括一些缺失值(“NA”)。 Then, I calculated Chi Square for the data "y" and the simulated "ideal" dataset "y.new".
然后,我计算了数据“y”和模拟的“理想”数据集“y.new”的卡方。 On the contrary, y.new does not have any missing values.
相反,y.new 没有任何缺失值。 Thus, when I try to compare the sum of E and E.new, E.new should always be larger if I leave out the missing data in y but not y.new.
因此,当我尝试比较 E 和 E.new 的总和时,如果我遗漏了 y 而不是 y.new 中的缺失数据,E.new 应该总是更大。
eval <- array(NA, c(3000, 195, 6))
E <- array(NA, c(3000, 195, 6))
E.new <- array(NA, c(3000, 195, 6))
y.new <- array(NA, c(195, 6))
for(i in 1:3000){
for(j in 1:195){
for(k in 1:6){
eval[i,j,k] <- p[i,j,k]*N[i,j]
E[i,j,k] <- ((y[j,k] - eval[i,j,k])^2) / (eval[i,j,k] + 0.5)
y.new[i,j,k] <- rbinom(1, N[i,j], p[i,j,k]) # Create new "ideal" dataset
E.new[i,j,k] <- ((y.new[i,j,k] - eval[i,j,k])^2) / (eval[i,j,k] + 0.5)
}
}
} # very slow! think about how to vectorize instead of nested for loops
fit <- sum(E)
fit.new <- sum(E.new)
Now, my question is how to handle the missing values?现在,我的问题是如何处理缺失值? Currently, the code above cannot subtract eval from y because of the missing values.
目前,由于缺少值,上面的代码无法从 y 中减去 eval。 Even if it could, the fit and fit.new wouldn't be comparable.
即使可以, fit 和 fit.new 也没有可比性。 My idea is to find the location of the missing values in y and drop those same [j,k] values from all the other arrays that I'm using.
我的想法是找到 y 中缺失值的位置,并从我正在使用的所有其他数组中删除相同的 [j,k] 值。 Any suggestions on how to best do this?
关于如何最好地做到这一点的任何建议?
EDIT: I'm getting a very strange result.编辑:我得到一个非常奇怪的结果。 Whether I run the code as above or as below (using sweep), E[1,,] is much smaller than E[>1,,].
无论我是按上面还是下面的方式运行代码(使用扫描),E[1,,] 都比 E[>1,,] 小得多。 What is particularly strange is that eval[1,,] and eval[>1,,] appear to be the same.
特别奇怪的是 eval[1,,] 和 eval[>1,,] 看起来是一样的。 I even tried replicating y[j,k] to make it y[i,j,k] where each y[i,,] were equal, just to see if it was the handling of different size matrices that was the problem.
我什至尝试复制 y[j,k] 使其成为 y[i,j,k],其中每个 y[i,,] 都相等,只是想看看是否是处理不同大小矩阵的问题。 Does anyone know why this would be the case?
有谁知道为什么会这样? In theory, with this simulated data, I think all the iterations of E[i,,] and E.new[i,,] should be somewhat similar.
从理论上讲,有了这个模拟数据,我认为 E[i,,] 和 E.new[i,,] 的所有迭代应该有些相似。 Below is some summary info to show what I'm talking about.
下面是一些摘要信息,以显示我在说什么。 This seems like a new question, but it relates to my original question, I just thought it must be the NA that were causing the problem but it seems like that might not be the only thing going on.
这似乎是一个新问题,但它与我原来的问题有关,我只是认为一定是 NA 导致了这个问题,但似乎这可能不是唯一发生的事情。
> summary(eval[1,,])
V1 V2 V3 V4
Min. : 0.01167 Min. : 0.01476 Min. : 0.0293 Min. : 0.01953
1st Qu.: 2.60909 1st Qu.: 2.35093 1st Qu.: 2.5239 1st Qu.: 1.85789
Median : 4.85460 Median : 5.12719 Median : 5.2480 Median : 4.35639
Mean : 5.09371 Mean : 5.39451 Mean : 5.3891 Mean : 4.72061
3rd Qu.: 6.91273 3rd Qu.: 7.44676 3rd Qu.: 7.5431 3rd Qu.: 7.06119
Max. :15.81298 Max. :14.94309 Max. :14.9851 Max. :16.25751
> summary(eval1[2,,])
V1 V2 V3 V4
Min. : 0.06346 Min. : 0.06468 Min. : 0.2092 Min. : 0.006769
1st Qu.: 2.44825 1st Qu.: 1.93702 1st Qu.: 2.4226 1st Qu.: 2.426689
Median : 4.16865 Median : 4.01536 Median : 5.0771 Median : 4.833679
Mean : 4.85646 Mean : 4.64887 Mean : 5.3450 Mean : 5.169656
3rd Qu.: 6.64691 3rd Qu.: 6.96278 3rd Qu.: 7.7034 3rd Qu.: 7.229125
Max. :13.00335 Max. :13.79093 Max. :17.2673 Max. :17.915080
> summary(E[1,,])
V1 V2 V3 V4
Min. :0.00001 Min. :0.00000 Min. :0.000003 Min. :0.000008
1st Qu.:0.02744 1st Qu.:0.02723 1st Qu.:0.023008 1st Qu.:0.035854
Median :0.11750 Median :0.11889 Median :0.109138 Median :0.146706
Mean :0.39880 Mean :0.41636 Mean :0.353876 Mean :0.479533
3rd Qu.:0.46435 3rd Qu.:0.40993 3rd Qu.:0.390625 3rd Qu.:0.604021
Max. :4.43466 Max. :4.83871 Max. :6.254577 Max. :5.231650
NA's :10 NA's :8 NA's :8 NA's :10
> summary(E[2,,])
V1 V2 V3
Min. : 0.0000 Min. : 0.00003 Min. : 0.00002
1st Qu.: 0.8213 1st Qu.: 0.42091 1st Qu.: 0.36853
Median : 2.0454 Median : 2.31697 Median : 2.39892
Mean : 8.0619 Mean : 9.40838 Mean : 6.38919
3rd Qu.: 5.6755 3rd Qu.: 6.34782 3rd Qu.: 4.89749
Max. :395.9499 Max. :172.83324 Max. :120.93648
NA's :10 NA's :8 NA's :8
Thanks, Dan谢谢,丹
You can add a test inside the inner loop and change the order of the loops as follows:您可以在内部循环中添加一个测试并更改循环的顺序,如下所示:
...
for(j in 1:195){
for(k in 1:6){
if ( !is.na(y(j,k)) ) {
for(i in 1:3000){
...
}
}
}
}
...
For more efficiency vectorize the inner loops (as described in the comments above).为了更有效地矢量化内部循环(如上面的评论所述)。
It is also possible to define a logical array with the same dimensions as y
representing the subset of defined positions, eg, subset <-.is.na(y)
and use it instead.也可以定义一个逻辑数组,其维度与
y
相同,表示已定义位置的子集,例如subset <-.is.na(y)
并改为使用它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.