How do I create a variable that tells me which of a number of other variables is the first one to not have a missing value for one observation?

Question

If I have the following data structure in my data frame df:

a  b  c  d

1  2  3  4
NA NA 1  2
NA 1  2  NA
NA NA NA 1

how can I create a variable that tells me, which of the variables is the first one to not have a missing value, such that:

a  b  c  d  var

1  2  3  4  a
NA NA 1  2  c
NA 1  2  NA b
NA NA NA 1  d

I need the code to work with variable names and not column numbers, because of the size of the dataset and changing the order of the variables.

I have tried:

df <- df %>% mutate(var = coalesce(deparse(substitute(a)), deparse(substitute(b)), deparse(substitute(c)), deparse(substitute(d))))

and

df <- df %>% mutate(var = deparse(substitute(do.call(coalesce, across(c(a, b, c, d))))))

trying to implement this approach. I got the code to extract the string of a variable name from: How to convert variable (object) name into String

Answer 1

You can do

df %>% mutate(var = apply(., 1, \(x) names(which(!is.na(x)))[1]))
#>    a  b  c  d var
#> 1  1  2  3  4   a
#> 2 NA NA  1  2   c
#> 3 NA  1  2 NA   b
#> 4 NA NA NA  1   d

Answer 2

We can use max.col , ie

names(df)[max.col(!is.na(df), ties.method = 'first')]
#[1] "a" "c" "b" "d"

Via dplyr ,

library(dplyr)

df %>% 
 mutate(var = names(.)[max.col(!is.na(.), ties.method = 'first')])

   a  b  c  d var
1  1  2  3  4   a
2 NA NA  1  2   c
3 NA  1  2 NA   b
4 NA NA NA  1   d

You can specify variables

df %>% 
 mutate(var = names(.[c('a', 'b')])[max.col(!is.na(.[c('a', 'b')]), ties.method = 'first')])
   a  b  c  d var
1  1  2  3  4   a
2 NA NA  1  2   a
3 NA  1  2 NA   b
4 NA NA NA  1   a

You can also change the order of the variables via select() , ie

df %>% 
 select(c, d, b, a) %>%
 mutate(new = names(.)[max.col(!is.na(.), ties.method = 'first')])

   c  d  b  a new
1  3  4  2  1   c
2  1  2 NA NA   c
3  2 NA  1 NA   c
4 NA  1 NA NA   d

You can also call select() again to restore the original order of columns but maintain the result from previous order, ie

df %>% 
 select(c, d, b, a) %>%
 mutate(new = names(.)[max.col(!is.na(.), ties.method = 'first')]) %>% 
 select(names(df), new)

   a  b  c  d new
1  1  2  3  4   c
2 NA NA  1  2   c
3 NA  1  2 NA   c
4 NA NA NA  1   d

To maintain all the variables at the end, you can join on the original data frame, ie

df %>% 
 select(c, d, b) %>%
 mutate(new = names(.)[max.col(!is.na(.), ties.method = 'first')]) %>% 
 left_join(df) %>% 
 select(names(df), new)

Joining, by = c("c", "d", "b")
   a  b  c  d new
1  1  2  3  4   c
2 NA NA  1  2   c
3 NA  1  2 NA   c
4 NA NA NA  1   d

How do I create a variable that tells me which of a number of other variables is the first one to not have a missing value for one observation?

Question

2 answers

solution1
1 2022-05-20 09:36:46

solution2
1 ACCPTED 2022-05-20 09:37:41

How do I create a variable that tells me which of a number of other variables is the first one to not have a missing value for one observation?

Question

2 answers

solution1 1 2022-05-20 09:36:46

solution2 1 ACCPTED 2022-05-20 09:37:41

solution1
1 2022-05-20 09:36:46

solution2
1 ACCPTED 2022-05-20 09:37:41