简体   繁体   English

比较 R dataframe 中的列组,并从每组两列中保留一个值

[英]compare sets of columns in R dataframe and keep one value from each set of two columns

Basically, I have a large dataset with many different variables.基本上,我有一个包含许多不同变量的大型数据集。 The data is ordered in pairs (2019 and 2020) and for some variables for neither year data is available for some only 2019 and some only 2020. I would like the 2020 data to 'override' the 2019 data, but only if it is available in 2020 and 2019. If no data is available for either year, then the data should stay missing.数据是成对排列的(2019 年和 2020 年),对于某些变量而言,这两个年份的数据仅适用于 2019 年和某些仅适用于 2020 年的数据。我希望 2020 年的数据能够“覆盖”2019 年的数据,但前提是它可用在 2020 年和 2019 年。如果任何一年都没有可用的数据,那么数据应该保持缺失。 I now do this with a little helper function, but this should be more scalable, so that I can do it for 200+ column pairs.我现在用一个小帮手 function 来做这件事,但这应该更具可扩展性,这样我就可以为 200 多列对做到这一点。 What am I missing in mutate(across(....),)我在mutate(across(....),)中缺少什么


# Create data
mydf <- tibble(ID = 1:5,
               var1_2019 = c(9, NA, 3, 2, NA),
               var1_2020 = c(NA, NA, 3, 2, 4),
               var2_2019 = c("A", "B",NA, "D", "C"),
               var2_2020 = c(NA, "B",NA, "R", NA),
               var3_2019 = c(T, F, NA, NA, NA),
               var3_2020 = c(NA, NA, NA, NA, F))

# create little helper function. this is good because
# it could be made more complex in the future, 
# for example for numeric variables keeping the larger of the two
which_to_keep_f <-
  function(x, y) {
    if (is.na(x) && is.na(y)) {
      output <- NA
    }
    if (is.na(x) && !is.na(y)) {
      output <- y
    }
    if (!is.na(x) && is.na(y)) {
      output <- x
    }
    if (!is.na(x) && !is.na(y)) {
      output <- y
    }
    output
  }
# vectorize it
which_to_keep_f_vec <- Vectorize(which_to_keep_f)

# use function inside mutate

mydf %>% 
  mutate(var1 = which_to_keep_f_vec(var1_2019, var1_2020)) %>% 
  mutate(var2 = which_to_keep_f_vec(var2_2019, var2_2020)) %>% 
  mutate(var3 = which_to_keep_f_vec(var3_2019, var3_2020)) %>% 
  select(-contains("_20"))

Is this what you are looking for.这是你想要的。 Here we apply your function to sets of pairs:在这里,我们将您的 function 应用于成对组:

library(dplyr)
library(stringr)
mydf %>%
  mutate(across(ends_with('_2019'), 
                ~list(which_to_keep_f_vec(.,
                                          get(str_replace(cur_column(), "_2019$", "_2020")))))) %>% 
  unnest()
      ID var1_2019 var1_2020 var2_2019 var2_2020 var3_2019 var3_2020
   <int>     <dbl>     <dbl> <chr>     <chr>     <lgl>     <lgl>    
 1     1         9        NA A         NA        TRUE      NA       
 2     1        NA        NA B         NA        FALSE     NA       
 3     1         3        NA NA        NA        NA        NA       
 4     1         2        NA R         NA        NA        NA       
 5     1         4        NA C         NA        FALSE     NA       
 6     2         9        NA A         B         TRUE      NA       
 7     2        NA        NA B         B         FALSE     NA       
 8     2         3        NA NA        B         NA        NA       
 9     2         2        NA R         B         NA        NA       
10     2         4        NA C         B         FALSE     NA       
# ... with 15 more rows

Here's an approach that results in just one variable for each pair of variables in your input table.这是一种方法,它只为输入表中的每对变量生成一个变量。 First, use pivot_longer() to collapse the pairs into single variables, and add year as a column (with twice as many observations).首先,使用pivot_longer()将这些对折叠成单个变量,并将year添加为一列(观察次数是原来的两倍)。

mydf_long = mydf %>%
  pivot_longer(cols = matches("_20"), names_to = c(".value", "year"),
               names_sep = "_")
      ID year   var1 var2  var3 
   <int> <chr> <dbl> <chr> <lgl>
 1     1 2019      9 A     TRUE 
 2     1 2020     NA NA    NA   
 3     2 2019     NA B     FALSE
 4     2 2020     NA B     NA   
 5     3 2019      3 NA    NA   
 6     3 2020      3 NA    NA   
 7     4 2019      2 D     NA   
 8     4 2020      2 R     NA   
 9     5 2019     NA C     NA   
10     5 2020      4 NA    FALSE

Next, use fill() to populate later NA values with earlier non-missing values.接下来,使用fill()用较早的非缺失值填充后面的 NA 值。 Then we can just filter to the most recent year (2020).然后我们可以过滤到最近的一年(2020 年)。 For each variable, that year will have its own value if it had one before;对于每个变量,如果之前有一个值,那一年将有自己的值; otherwise, it will carry over the value from the previous year.否则,它将结转上一年的价值。

mydf_long %>%
  group_by(ID) %>%
  fill(var1, var2, var3) %>%
  filter(year == 2020)
     ID year   var1 var2  var3 
  <int> <chr> <dbl> <chr> <lgl>
1     1 2020      9 A     TRUE 
2     2 2020     NA B     FALSE
3     3 2020      3 NA    NA   
4     4 2020      2 R     NA   
5     5 2020      4 C     FALSE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM