简体   繁体   English

在 r 中检查逻辑变量之和是否大于 n,使用 na

[英]checking if sum of logical variables is greater than n, with na, in r

I have a dataframe with 5 binary variables ( TRUE or FALSE , but represented as 0 or 1 for convenience) which can have missing values:我有一个带有 5 个二进制变量( TRUEFALSE ,但为方便起见表示为01 )的 dataframe ,它们可能有缺失值:

df <- data.frame(a = c(1,0,1,0,0,...),
                 b = c(1,0,NA,0,1,...),
                 c = c(1,0,1,0,NA,...),
                 d = c(0,1,1,NA,NA,...),
                 e = c(0,0,0,1,1,...))
     a  b  c  d  e
 1   1  1  1  0  0
 2   0  0  0  1  0
 3   1 NA  1  1  0
 4   0  0  0 NA  1
 5   0  1 NA NA  1
...

Now I want to make a variable that indicates whether the observation satisfies more than two conditions out of the five, that is, whether the sum of a , b , c , d , and e is greater than 2.现在我想创建一个变量,指示观察是否满足五个中的两个以上条件,即abcde的总和是否大于 2。

For the first row and the second row, the values are obviously TRUE and FALSE respectively.对于第一行和第二行,值显然分别为TRUEFALSE For the third row, the value should be TRUE , since the sum is greater than 2 regardless of whether b is TRUE or FALSE .对于第三行,该值应为TRUE ,因为无论bTRUE还是FALSE ,总和都大于 2。 For the third row, the value should be FALSE , since the sum is less than or equal to 2 regardless of whether d is TRUE or FALSE .对于第三行,值应为FALSE ,因为无论dTRUE还是FALSE ,总和都小于或等于 2。 For the fifth row, the value should be NA , since the sum can range from 2 to 4 depending on c and d .对于第五行,值应为NA ,因为总和的范围为 2 到 4,具体取决于cd So the desirable vector is c(TRUE, FALSE, TRUE, FALSE, NA, ...) .所以理想的向量是c(TRUE, FALSE, TRUE, FALSE, NA, ...)

Here is my attempt:这是我的尝试:

df %>%
  mutate(a0 = ifelse(is.na(a), 0, a),
         b0 = ifelse(is.na(b), 0, b),
         c0 = ifelse(is.na(c), 0, c),
         d0 = ifelse(is.na(d), 0, d),
         e0 = ifelse(is.na(e), 0, e),
         a1 = ifelse(is.na(a), 1, a),
         b1 = ifelse(is.na(b), 1, b),
         c1 = ifelse(is.na(c), 1, c),
         d1 = ifelse(is.na(d), 1, d),
         e1 = ifelse(is.na(e), 1, e)
         ) %>%
  mutate(summin = a0 + b0 + c0 + d0 + e0,
         summax = a1 + b1 + c1 + d1 + e1) %>%
  mutate(f = ifelse(summax <= 2,
                    FALSE,
                    ifelse(summin >= 3, TRUE, NA)))

This did work, but I had to make too many redunant variables, plus the code would be too lengthy if there were more variables.这确实有效,但我不得不创建太多冗余变量,而且如果有更多变量,代码会太长。 Is there any better solution?有没有更好的解决办法?

I just noticed that you want NA in case the outcome of the missing value will determine the TRUE/FALSE outcome, so I have changed the answer.我只是注意到你想要 NA 以防缺失值的结果将决定 TRUE/FALSE 结果,所以我改变了答案。

Combining two if_else statements can first test if the row already have a sum of more than 2, and if not, check if the row sum plus the number of missing values is 2 or less.结合两个if_else语句可以先检测该行是否已经有大于2的总和,如果没有则检测该行总和加上缺失值个数是否小于等于2。

library(tidyverse)
n <- 2
want <- ifelse(rowSums(df, na.rm = TRUE) > n, 
               TRUE, 
               if_else((rowSums(df, na.rm = TRUE) + rowSums(is.na(df)))<=n,
                        FALSE, 
                        NA))

If you want to stick to base-R you can use the function ifelse() instead.如果您想坚持使用 base-R,您可以改用 function ifelse()。

I am not sure what you mean by "For the fifth row, the value should be NA, since the sum can range from 2 to 4 depending on c and d."我不确定您所说的“对于第五行,该值应该是 NA,因为总和的范围从 2 到 4,具体取决于 c 和 d”。

But the following results in the vector you wish for:但是以下结果会导致您想要的向量:

test <- ifelse(is.na(df$c), NA, ifelse(rowSums(df[1:5,], na.rm=T) > 2, TRUE, FALSE))

If there is an NA value in the column c, an NA value will be inserted in the new vector test .如果 c 列中有 NA 值,则将在新向量test中插入一个 NA 值。 Else, it is tested if the sum of the first 5 columns is greater than 2 - if true, TRUE will be inserted and FALSE when the sum is lower than or exactly two.否则,测试前 5 列的总和是否大于 2 - 如果为真,则插入TRUE ,当总和小于或恰好为 2 时插入FALSE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM