在 R 中的多列中计算 NA

Question

I'm trying to count the number of NA in multiple columns of my data.我正在尝试计算多列数据中 NA 的数量。 Here is a reproducible sample.这是一个可重现的示例。

structure(list(V2QE38A = c(1, 0, 1, 0, 1, 1, 1, 0, 1, 0), V2QE38B = c(0, 
0, 0, 0, 0, 1, 0, 0, 0, 0), V2QE38C = c(1, 1, 0, 3, 2, 0, 0, 
3, 1, 1), V2QE38D = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 
10L), class = "data.frame")

I tried two methods: First one:我尝试了两种方法：第一种：

dt %>% select(starts_with("V2QE38")) %>% colSums(is.na(.))

And this gives me some results (in short, I have NAs in some columns) Then I tried another one:这给了我一些结果（简而言之，我在某些列中有 NA）然后我尝试了另一个结果：

colSums(is.na(dt[,c("V2QE38A", "V2QE38B", "V2QE38C", "V2QE38D")]))

And I found no NA in any of these columns.我在任何这些列中都没有发现 NA。

I think the second result is correct.我认为第二个结果是正确的。 But I'm just wondering what did I do wrong to get the first result?但我只是想知道我做错了什么才能得到第一个结果？ Thank you!谢谢！

Answer 1

In the first case, there are multiple functions passed.在第一种情况下，传递了多个函数。 We may either need to block it with {}我们可能需要使用{}阻止它

library(dplyr)
dt %>% 
    select(starts_with("V2QE38")) %>%
    {colSums(is.na(.))}
V2QE38A V2QE38B V2QE38C V2QE38D 
      0       0       0       0

or have another %>%或者有另一个%>%

dt %>%
    select(starts_with("V2QE38")) %>%
    is.na %>%
    colSums

-output -输出

V2QE38A V2QE38B V2QE38C V2QE38D 
      0       0       0       0

The issue is that colSums is executed first without evaluating the is.na问题是colSums首先执行而不评估is.na

> dt %>% 
   select(starts_with("V2QE38")) %>% 
   colSums(.)
V2QE38A V2QE38B V2QE38C V2QE38D 
      6       1      12       0

which is the same as the OP's output with colSums(is.na(.))这与带有colSums(is.na(.))的 OP 输出相同

Answer 2

Base solution using sapply and an annonymous function function(x){sum(is.na(x))} :使用sapply和匿名函数function(x){sum(is.na(x))}基本解决方案：

data = structure(list(V2QE38A = c(1, 0, 1, 0, 1, 1, 1, 0, 1, 0), V2QE38B = c(0, 
0, 0, 0, 0, 1, 0, 0, 0, 0), V2QE38C = c(1, 1, 0, 3, 2, 0, 0, 
3, 1, 1), V2QE38D = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 
10L), class = "data.frame")

sapply(data, function(x){sum(is.na(x))})
# V2QE38A V2QE38B V2QE38C V2QE38D 
#       0       0       0       0

Explanation:解释：

sapply applies a function on a list . sapply在list上应用一个函数。 data.frame is a list, with each vector being an item of this list. data.frame是一个列表，每个向量都是这个列表的一个项目。 The s in sapply is for simplify, so sapply will try to convert the output list (from lapply ) to a vector. sapply的s是为了简化，因此sapply会尝试将输出列表（从lapply ）转换为向量。 If the required output is a list (it has some advantages), use lapply instead.如果所需的输出是一个列表（它有一些优点），请改用lapply 。

is.na returns a boolean TRUE/FALSE vector. is.na返回一个布尔值TRUE/FALSE向量。 This can be converted to a numeric vector with 1/0 values.这可以转换为具有1/0值的数值向量。

sum converts the TRUE/FALSE vector into a 1/0 vector and sums the values. sum将TRUE/FALSE向量转换为1/0向量并对值求和。

Alternative solutions:替代解决方案：

Alternatively, instead of treating the data.frame as a list, treat it as a matrix.或者，不是将data.frame视为列表，而是将其视为矩阵。 Then the highly optimized rowSums and colSums can come into play.然后高度优化的rowSums和colSums可以发挥作用。

colSums(is.na(data))
# V2QE38A V2QE38B V2QE38C V2QE38D 
#       0       0       0       0 

rowSums(is.na(data))
# 1  2  3  4  5  6  7  8  9 10 
# 0  0  0  0  0  0  0  0  0  0

This is great if you have a matrix and want to find where the NA s are.如果您有一个matrix并想找到NA的位置，这很好。

在 R 中的多列中计算 NA

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-10-20 20:26:18

解决方案2
1 2021-10-20 20:40:55

Explanation:解释：

Alternative solutions:替代解决方案：

在 R 中的多列中计算 NA

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-10-20 20:26:18

解决方案2 1 2021-10-20 20:40:55

Explanation:解释：

Alternative solutions:替代解决方案：

解决方案1
1 已采纳 2021-10-20 20:26:18

解决方案2
1 2021-10-20 20:40:55