简体   繁体   中英

compare sets of columns in R dataframe and keep one value from each set of two columns

Basically, I have a large dataset with many different variables. The data is ordered in pairs (2019 and 2020) and for some variables for neither year data is available for some only 2019 and some only 2020. I would like the 2020 data to 'override' the 2019 data, but only if it is available in 2020 and 2019. If no data is available for either year, then the data should stay missing. I now do this with a little helper function, but this should be more scalable, so that I can do it for 200+ column pairs. What am I missing in mutate(across(....),)


# Create data
mydf <- tibble(ID = 1:5,
               var1_2019 = c(9, NA, 3, 2, NA),
               var1_2020 = c(NA, NA, 3, 2, 4),
               var2_2019 = c("A", "B",NA, "D", "C"),
               var2_2020 = c(NA, "B",NA, "R", NA),
               var3_2019 = c(T, F, NA, NA, NA),
               var3_2020 = c(NA, NA, NA, NA, F))

# create little helper function. this is good because
# it could be made more complex in the future, 
# for example for numeric variables keeping the larger of the two
which_to_keep_f <-
  function(x, y) {
    if (is.na(x) && is.na(y)) {
      output <- NA
    }
    if (is.na(x) && !is.na(y)) {
      output <- y
    }
    if (!is.na(x) && is.na(y)) {
      output <- x
    }
    if (!is.na(x) && !is.na(y)) {
      output <- y
    }
    output
  }
# vectorize it
which_to_keep_f_vec <- Vectorize(which_to_keep_f)

# use function inside mutate

mydf %>% 
  mutate(var1 = which_to_keep_f_vec(var1_2019, var1_2020)) %>% 
  mutate(var2 = which_to_keep_f_vec(var2_2019, var2_2020)) %>% 
  mutate(var3 = which_to_keep_f_vec(var3_2019, var3_2020)) %>% 
  select(-contains("_20"))

Is this what you are looking for. Here we apply your function to sets of pairs:

library(dplyr)
library(stringr)
mydf %>%
  mutate(across(ends_with('_2019'), 
                ~list(which_to_keep_f_vec(.,
                                          get(str_replace(cur_column(), "_2019$", "_2020")))))) %>% 
  unnest()
      ID var1_2019 var1_2020 var2_2019 var2_2020 var3_2019 var3_2020
   <int>     <dbl>     <dbl> <chr>     <chr>     <lgl>     <lgl>    
 1     1         9        NA A         NA        TRUE      NA       
 2     1        NA        NA B         NA        FALSE     NA       
 3     1         3        NA NA        NA        NA        NA       
 4     1         2        NA R         NA        NA        NA       
 5     1         4        NA C         NA        FALSE     NA       
 6     2         9        NA A         B         TRUE      NA       
 7     2        NA        NA B         B         FALSE     NA       
 8     2         3        NA NA        B         NA        NA       
 9     2         2        NA R         B         NA        NA       
10     2         4        NA C         B         FALSE     NA       
# ... with 15 more rows

Here's an approach that results in just one variable for each pair of variables in your input table. First, use pivot_longer() to collapse the pairs into single variables, and add year as a column (with twice as many observations).

mydf_long = mydf %>%
  pivot_longer(cols = matches("_20"), names_to = c(".value", "year"),
               names_sep = "_")
      ID year   var1 var2  var3 
   <int> <chr> <dbl> <chr> <lgl>
 1     1 2019      9 A     TRUE 
 2     1 2020     NA NA    NA   
 3     2 2019     NA B     FALSE
 4     2 2020     NA B     NA   
 5     3 2019      3 NA    NA   
 6     3 2020      3 NA    NA   
 7     4 2019      2 D     NA   
 8     4 2020      2 R     NA   
 9     5 2019     NA C     NA   
10     5 2020      4 NA    FALSE

Next, use fill() to populate later NA values with earlier non-missing values. Then we can just filter to the most recent year (2020). For each variable, that year will have its own value if it had one before; otherwise, it will carry over the value from the previous year.

mydf_long %>%
  group_by(ID) %>%
  fill(var1, var2, var3) %>%
  filter(year == 2020)
     ID year   var1 var2  var3 
  <int> <chr> <dbl> <chr> <lgl>
1     1 2020      9 A     TRUE 
2     2 2020     NA B     FALSE
3     3 2020      3 NA    NA   
4     4 2020      2 R     NA   
5     5 2020      4 C     FALSE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM