简体   繁体   English

在 R 中的多列中计算 NA

[英]Count NA in multiple columns in R

I'm trying to count the number of NA in multiple columns of my data.我正在尝试计算多列数据中 NA 的数量。 Here is a reproducible sample.这是一个可重现的示例。

structure(list(V2QE38A = c(1, 0, 1, 0, 1, 1, 1, 0, 1, 0), V2QE38B = c(0, 
0, 0, 0, 0, 1, 0, 0, 0, 0), V2QE38C = c(1, 1, 0, 3, 2, 0, 0, 
3, 1, 1), V2QE38D = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 
10L), class = "data.frame")

I tried two methods: First one:我尝试了两种方法:第一种:

dt %>% select(starts_with("V2QE38")) %>% colSums(is.na(.))

And this gives me some results (in short, I have NAs in some columns) Then I tried another one:这给了我一些结果(简而言之,我在某些列中有 NA)然后我尝试了另一个结果:

colSums(is.na(dt[,c("V2QE38A", "V2QE38B", "V2QE38C", "V2QE38D")]))

And I found no NA in any of these columns.我在任何这些列中都没有发现 NA。

I think the second result is correct.我认为第二个结果是正确的。 But I'm just wondering what did I do wrong to get the first result?但我只是想知道我做错了什么才能得到第一个结果? Thank you!谢谢!

In the first case, there are multiple functions passed.在第一种情况下,传递了多个函数。 We may either need to block it with {}我们可能需要使用{}阻止它

library(dplyr)
dt %>% 
    select(starts_with("V2QE38")) %>%
    {colSums(is.na(.))}
V2QE38A V2QE38B V2QE38C V2QE38D 
      0       0       0       0 

or have another %>%或者有另一个%>%

dt %>%
    select(starts_with("V2QE38")) %>%
    is.na %>%
    colSums

-output -输出

V2QE38A V2QE38B V2QE38C V2QE38D 
      0       0       0       0 

The issue is that colSums is executed first without evaluating the is.na问题是colSums首先执行而不评估is.na

> dt %>% 
   select(starts_with("V2QE38")) %>% 
   colSums(.)
V2QE38A V2QE38B V2QE38C V2QE38D 
      6       1      12       0 

which is the same as the OP's output with colSums(is.na(.))这与带有colSums(is.na(.))的 OP 输出相同

Base solution using sapply and an annonymous function function(x){sum(is.na(x))} :使用sapply和匿名函数function(x){sum(is.na(x))}基本解决方案:

data = structure(list(V2QE38A = c(1, 0, 1, 0, 1, 1, 1, 0, 1, 0), V2QE38B = c(0, 
0, 0, 0, 0, 1, 0, 0, 0, 0), V2QE38C = c(1, 1, 0, 3, 2, 0, 0, 
3, 1, 1), V2QE38D = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 
10L), class = "data.frame")

sapply(data, function(x){sum(is.na(x))})
# V2QE38A V2QE38B V2QE38C V2QE38D 
#       0       0       0       0 

Explanation:解释:

sapply applies a function on a list . sapplylist上应用一个函数。 data.frame is a list, with each vector being an item of this list. data.frame是一个列表,每个向量都是这个列表的一个项目。 The s in sapply is for simplify, so sapply will try to convert the output list (from lapply ) to a vector. sapplys是为了简化,因此sapply会尝试将输出列表(从lapply )转换为向量。 If the required output is a list (it has some advantages), use lapply instead.如果所需的输出是一个列表(它有一些优点),请改用lapply

is.na returns a boolean TRUE/FALSE vector. is.na返回一个布尔值TRUE/FALSE向量。 This can be converted to a numeric vector with 1/0 values.这可以转换为具有1/0值的数值向量。

sum converts the TRUE/FALSE vector into a 1/0 vector and sums the values. sumTRUE/FALSE向量转换为1/0向量并对值求和。

Alternative solutions:替代解决方案:

Alternatively, instead of treating the data.frame as a list, treat it as a matrix.或者,不是将data.frame视为列表,而是将其视为矩阵。 Then the highly optimized rowSums and colSums can come into play.然后高度优化的rowSumscolSums可以发挥作用。

colSums(is.na(data))
# V2QE38A V2QE38B V2QE38C V2QE38D 
#       0       0       0       0 

rowSums(is.na(data))
# 1  2  3  4  5  6  7  8  9 10 
# 0  0  0  0  0  0  0  0  0  0

This is great if you have a matrix and want to find where the NA s are.如果您有一个matrix并想找到NA的位置,这很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM