简体   繁体   中英

How to summarise across different types of variables with dplyr::c_across()

I have data with different types of variables. Some are character, some factors, and some numeric, like below:

df <- data.frame(a = c("tt", "ss", "ss", NA), b=c(2,3,NA,1), c=c(1,2,NA, NA), d=c("tt", "ss", "ss", NA))

I'm trying to count the number of missing values per observation using c_across in dplyr However, c_across doesn't seem to be able to combine different type of values, as the error message below suggests

df %>%
  rowwise() %>%
  summarise(NAs = sum(is.na(c_across())))

Error: Problem with summarise() input NAs . x Can't combine a <factor> and b . ℹ Input NAs is sum(is.na(c_across())) . ℹ The error occurred in row 1.

Indeed, if I include only numeric variables, it works.

df %>%
  rowwise() %>%
  summarise(NAs = sum(is.na(c_across(b:c))))

Same thing if I include only character variables

df %>%
  rowwise() %>%
  summarise(NAs = sum(is.na(c_across(c(a,d)))))

I could solve the issue without using c_across like below, but I have lots of variables, so it's not very practical.

df %>%
  rowwise() %>%
  summarise(NAs = is.na(a)+is.na(b)+is.na(c)+is.na(d))

I could use the traditional apply approach, like below, but I'd like to solve this using dplyr .

apply(df, 1, function(x)sum(is.na(x)))

Any suggestions as to how to compute the number of missing values, row-wise, efficiently, and using dplyr ?

I would suggest this approach. The issue is because of two things. First, different type of variables in your dataframe an second that you need a key variable for the rowwise style task. So, in next code we first transform variables into a similar type, then we create an id based on the number of row. With this we use that element as input for rowwise() and then we can use c_across() function. Here the code (I have used you df data):

library(tidyverse)
#Code
df %>% 
  mutate_at(vars(everything()),funs(as.character(.))) %>%
  mutate(id=1:n()) %>%
  rowwise(id) %>%
  mutate(NAs = sum(is.na(c_across(a:d))))

Output:

# A tibble: 4 x 6
# Rowwise:  id
  a     b     c     d        id   NAs
  <chr> <chr> <chr> <chr> <int> <int>
1 tt    2     1     tt        1     0
2 ss    3     2     ss        2     0
3 ss    NA    NA    ss        3     2
4 NA    1     NA    NA        4     3

And we can avoid the mutate_at() function using the new across() with mutate() to homologate the variables:

#Code 2
df %>% 
  mutate(across(a:d,~as.character(.))) %>%
  mutate(id=1:n()) %>%
  rowwise(id) %>%
  mutate(NAs = sum(is.na(c_across(a:d))))

Output:

# A tibble: 4 x 6
# Rowwise:  id
  a     b     c     d        id   NAs
  <chr> <chr> <chr> <chr> <int> <int>
1 tt    2     1     tt        1     0
2 ss    3     2     ss        2     0
3 ss    NA    NA    ss        3     2
4 NA    1     NA    NA        4     3

A much faster option is not to use rowwise or c_across , but with rowSums

library(dplyr)
df %>% 
     mutate(NAs = rowSums(is.na(.)))
#     a  b  c    d NAs
#1   tt  2  1   tt   0
#2   ss  3  2   ss   0
#3   ss NA NA   ss   2
#4 <NA>  1 NA <NA>   3

If we want to select certain columns ie numeric

df %>%
   mutate(NAs = rowSums(is.na(select(., where(is.numeric)))))
#     a  b  c    d NAs
#1   tt  2  1   tt   0
#2   ss  3  2   ss   0
#3   ss NA NA   ss   2
#4 <NA>  1 NA <NA>   1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM