简体   繁体   English

如何处理 R 中 for 循环中的缺失数据 (NA)

[英]How to handle missing data (NA) in for loops in R

I am trying to calculate Chi Square discrepancies for the observed and simulated data and evaluate the model fit using Bayesian inference.我正在尝试计算观察数据和模拟数据的卡方差异,并使用贝叶斯推理评估模型拟合。 The observed dataset contains missing ("NA") values.观察到的数据集包含缺失(“NA”)值。 However, there are no missing values for the simulated one.但是,模拟的没有缺失值。 Thus, I am unable to compare the discrepancy stats between them.因此,我无法比较它们之间的差异统计数据。

The code presented below is an example, which is similar to my work:下面给出的代码是一个示例,与我的工作类似:

p <- array(runif(3000*195*6, 0, 1), c(3000, 195, 6))
N <- array(rpois(3000*195, 10), c(3000, 195))
y <- array(0, c(195, 6))
for(j in 1:195){
  for(k in 1:6){
    y[j,k] <- (rbinom(1, N[j], p[1,j,k]))
  }
}

foo <- runif(50, 1, 195)
bar <- runif(50, 1, 6)
for(i in 1:50){
  y[foo[i], bar[i]] <- NA
}

The code derives the response variable y including some missing values ("NA").该代码导出响应变量 y,其中包括一些缺失值(“NA”)。 Then, I calculated Chi Square for the data "y" and the simulated "ideal" dataset "y.new".然后,我计算了数据“y”和模拟的“理想”数据集“y.new”的卡方。 On the contrary, y.new does not have any missing values.相反,y.new 没有任何缺失值。 Thus, when I try to compare the sum of E and E.new, E.new should always be larger if I leave out the missing data in y but not y.new.因此,当我尝试比较 E 和 E.new 的总和时,如果我遗漏了 y 而不是 y.new 中的缺失数据,E.new 应该总是更大。

eval <- array(NA, c(3000, 195, 6))
E <- array(NA, c(3000, 195, 6))
E.new <- array(NA, c(3000, 195, 6))
y.new <- array(NA, c(195, 6))
for(i in 1:3000){
  for(j in 1:195){
    for(k in 1:6){
       eval[i,j,k] <- p[i,j,k]*N[i,j]
       E[i,j,k] <- ((y[j,k] - eval[i,j,k])^2) / (eval[i,j,k] + 0.5)
       y.new[i,j,k] <- rbinom(1, N[i,j], p[i,j,k])  # Create new "ideal" dataset
       E.new[i,j,k] <- ((y.new[i,j,k] - eval[i,j,k])^2) / (eval[i,j,k] + 0.5)
    }
  }
} # very slow! think about how to vectorize instead of nested for loops

fit <- sum(E)
fit.new <- sum(E.new)

Now, my question is how to handle the missing values?现在,我的问题是如何处理缺失值? Currently, the code above cannot subtract eval from y because of the missing values.目前,由于缺少值,上面的代码无法从 y 中减去 eval。 Even if it could, the fit and fit.new wouldn't be comparable.即使可以, fit 和 fit.new 也没有可比性。 My idea is to find the location of the missing values in y and drop those same [j,k] values from all the other arrays that I'm using.我的想法是找到 y 中缺失值的位置,并从我正在使用的所有其他数组中删除相同的 [j,k] 值。 Any suggestions on how to best do this?关于如何最好地做到这一点的任何建议?

EDIT: I'm getting a very strange result.编辑:我得到一个非常奇怪的结果。 Whether I run the code as above or as below (using sweep), E[1,,] is much smaller than E[>1,,].无论我是按上面还是下面的方式运行代码(使用扫描),E[1,,] 都比 E[>1,,] 小得多。 What is particularly strange is that eval[1,,] and eval[>1,,] appear to be the same.特别奇怪的是 eval[1,,] 和 eval[>1,,] 看起来是一样的。 I even tried replicating y[j,k] to make it y[i,j,k] where each y[i,,] were equal, just to see if it was the handling of different size matrices that was the problem.我什至尝试复制 y[j,k] 使其成为 y[i,j,k],其中每个 y[i,,] 都相等,只是想看看是否是处理不同大小矩阵的问题。 Does anyone know why this would be the case?有谁知道为什么会这样? In theory, with this simulated data, I think all the iterations of E[i,,] and E.new[i,,] should be somewhat similar.从理论上讲,有了这个模拟数据,我认为 E[i,,] 和 E.new[i,,] 的所有迭代应该有些相似。 Below is some summary info to show what I'm talking about.下面是一些摘要信息,以显示我在说什么。 This seems like a new question, but it relates to my original question, I just thought it must be the NA that were causing the problem but it seems like that might not be the only thing going on.这似乎是一个新问题,但它与我原来的问题有关,我只是认为一定是 NA 导致了这个问题,但似乎这可能不是唯一发生的事情。

> summary(eval[1,,])
       V1                 V2                 V3                V4          
 Min.   : 0.01167   Min.   : 0.01476   Min.   : 0.0293   Min.   : 0.01953  
 1st Qu.: 2.60909   1st Qu.: 2.35093   1st Qu.: 2.5239   1st Qu.: 1.85789  
 Median : 4.85460   Median : 5.12719   Median : 5.2480   Median : 4.35639  
 Mean   : 5.09371   Mean   : 5.39451   Mean   : 5.3891   Mean   : 4.72061  
 3rd Qu.: 6.91273   3rd Qu.: 7.44676   3rd Qu.: 7.5431   3rd Qu.: 7.06119  
 Max.   :15.81298   Max.   :14.94309   Max.   :14.9851   Max.   :16.25751  

> summary(eval1[2,,])
       V1                 V2                 V3                V4           
 Min.   : 0.06346   Min.   : 0.06468   Min.   : 0.2092   Min.   : 0.006769  
 1st Qu.: 2.44825   1st Qu.: 1.93702   1st Qu.: 2.4226   1st Qu.: 2.426689  
 Median : 4.16865   Median : 4.01536   Median : 5.0771   Median : 4.833679  
 Mean   : 4.85646   Mean   : 4.64887   Mean   : 5.3450   Mean   : 5.169656  
 3rd Qu.: 6.64691   3rd Qu.: 6.96278   3rd Qu.: 7.7034   3rd Qu.: 7.229125  
 Max.   :13.00335   Max.   :13.79093   Max.   :17.2673   Max.   :17.915080  

> summary(E[1,,])
       V1                V2                V3                 V4          
 Min.   :0.00001   Min.   :0.00000   Min.   :0.000003   Min.   :0.000008  
 1st Qu.:0.02744   1st Qu.:0.02723   1st Qu.:0.023008   1st Qu.:0.035854  
 Median :0.11750   Median :0.11889   Median :0.109138   Median :0.146706  
 Mean   :0.39880   Mean   :0.41636   Mean   :0.353876   Mean   :0.479533  
 3rd Qu.:0.46435   3rd Qu.:0.40993   3rd Qu.:0.390625   3rd Qu.:0.604021  
 Max.   :4.43466   Max.   :4.83871   Max.   :6.254577   Max.   :5.231650  
 NA's   :10        NA's   :8         NA's   :8          NA's   :10        

> summary(E[2,,])
       V1                 V2                  V3           
 Min.   :  0.0000   Min.   :  0.00003   Min.   :  0.00002  
 1st Qu.:  0.8213   1st Qu.:  0.42091   1st Qu.:  0.36853  
 Median :  2.0454   Median :  2.31697   Median :  2.39892  
 Mean   :  8.0619   Mean   :  9.40838   Mean   :  6.38919  
 3rd Qu.:  5.6755   3rd Qu.:  6.34782   3rd Qu.:  4.89749  
 Max.   :395.9499   Max.   :172.83324   Max.   :120.93648  
 NA's   :10         NA's   :8           NA's   :8          

Thanks, Dan谢谢,丹

You can add a test inside the inner loop and change the order of the loops as follows:您可以在内部循环中添加一个测试并更改循环的顺序,如下所示:

...
for(j in 1:195){
   for(k in 1:6){
      if ( !is.na(y(j,k)) ) {
         for(i in 1:3000){
             ...
         }
      }
   }
}
...

For more efficiency vectorize the inner loops (as described in the comments above).为了更有效地矢量化内部循环(如上面的评论所述)。

It is also possible to define a logical array with the same dimensions as y representing the subset of defined positions, eg, subset <-.is.na(y) and use it instead.也可以定义一个逻辑数组,其维度与y相同,表示已定义位置的子集,例如subset <-.is.na(y)并改为使用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM