[英]Count NA in multiple columns in R
I'm trying to count the number of NA in multiple columns of my data.我正在尝试计算多列数据中 NA 的数量。 Here is a reproducible sample.
这是一个可重现的示例。
structure(list(V2QE38A = c(1, 0, 1, 0, 1, 1, 1, 0, 1, 0), V2QE38B = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), V2QE38C = c(1, 1, 0, 3, 2, 0, 0,
3, 1, 1), V2QE38D = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
10L), class = "data.frame")
I tried two methods: First one:我尝试了两种方法:第一种:
dt %>% select(starts_with("V2QE38")) %>% colSums(is.na(.))
And this gives me some results (in short, I have NAs in some columns) Then I tried another one:这给了我一些结果(简而言之,我在某些列中有 NA)然后我尝试了另一个结果:
colSums(is.na(dt[,c("V2QE38A", "V2QE38B", "V2QE38C", "V2QE38D")]))
And I found no NA in any of these columns.我在任何这些列中都没有发现 NA。
I think the second result is correct.我认为第二个结果是正确的。 But I'm just wondering what did I do wrong to get the first result?
但我只是想知道我做错了什么才能得到第一个结果? Thank you!
谢谢!
In the first case, there are multiple functions passed.在第一种情况下,传递了多个函数。 We may either need to block it with
{}
我们可能需要使用
{}
阻止它
library(dplyr)
dt %>%
select(starts_with("V2QE38")) %>%
{colSums(is.na(.))}
V2QE38A V2QE38B V2QE38C V2QE38D
0 0 0 0
or have another %>%
或者有另一个
%>%
dt %>%
select(starts_with("V2QE38")) %>%
is.na %>%
colSums
-output -输出
V2QE38A V2QE38B V2QE38C V2QE38D
0 0 0 0
The issue is that colSums
is executed first without evaluating the is.na
问题是
colSums
首先执行而不评估is.na
> dt %>%
select(starts_with("V2QE38")) %>%
colSums(.)
V2QE38A V2QE38B V2QE38C V2QE38D
6 1 12 0
which is the same as the OP's output with colSums(is.na(.))
这与带有
colSums(is.na(.))
的 OP 输出相同
Base solution using sapply
and an annonymous function function(x){sum(is.na(x))}
:使用
sapply
和匿名函数function(x){sum(is.na(x))}
基本解决方案:
data = structure(list(V2QE38A = c(1, 0, 1, 0, 1, 1, 1, 0, 1, 0), V2QE38B = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), V2QE38C = c(1, 1, 0, 3, 2, 0, 0,
3, 1, 1), V2QE38D = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
10L), class = "data.frame")
sapply(data, function(x){sum(is.na(x))})
# V2QE38A V2QE38B V2QE38C V2QE38D
# 0 0 0 0
sapply
applies a function on a list
. sapply
在list
上应用一个函数。 data.frame
is a list, with each vector being an item of this list. data.frame
是一个列表,每个向量都是这个列表的一个项目。 The s
in sapply
is for simplify, so sapply
will try to convert the output list (from lapply
) to a vector. sapply
的s
是为了简化,因此sapply
会尝试将输出列表(从lapply
)转换为向量。 If the required output is a list (it has some advantages), use lapply
instead.如果所需的输出是一个列表(它有一些优点),请改用
lapply
。
is.na
returns a boolean TRUE/FALSE
vector. is.na
返回一个布尔值TRUE/FALSE
向量。 This can be converted to a numeric vector with 1/0
values.这可以转换为具有
1/0
值的数值向量。
sum
converts the TRUE/FALSE
vector into a 1/0
vector and sums the values. sum
将TRUE/FALSE
向量转换为1/0
向量并对值求和。
Alternatively, instead of treating the data.frame
as a list, treat it as a matrix.或者,不是将
data.frame
视为列表,而是将其视为矩阵。 Then the highly optimized rowSums
and colSums
can come into play.然后高度优化的
rowSums
和colSums
可以发挥作用。
colSums(is.na(data))
# V2QE38A V2QE38B V2QE38C V2QE38D
# 0 0 0 0
rowSums(is.na(data))
# 1 2 3 4 5 6 7 8 9 10
# 0 0 0 0 0 0 0 0 0 0
This is great if you have a matrix
and want to find where the NA
s are.如果您有一个
matrix
并想找到NA
的位置,这很好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.