简体   繁体   English

在左连接后比较多对 x/y 列,如果不同,则在 R 中使用 y

[英]Compare multiple pairs of x/y columns after left join and if different use y in R

I have a data.frame df1 .我有一个 data.frame df1 Some selected rows have been manually reviewed and updated, creating a second data.frame df1updated which has all the same columns, where some of the data has been changed, plus additional columns.一些选定的行已经过手动审查和更新,创建了第二个 data.frame df1updated ,它具有所有相同的列,其中一些数据已被更改,加上额外的列。

I want to join the updated version to the original and, where the data has been change, replace the original, where there has been no change, retain the original, where the data has not been reviewed (ie is not in df1updated ) retain the original我想将更新后的版本加入到原来的中,并且,数据发生变化的地方,替换原来的,没有变化的地方,保留原来的,数据没有经过审查的地方(即不在df1updated )保留原版的

I have done this in this small example as follows:我在这个小例子中这样做了,如下所示:

library(lubridate)
library(dplyr)
library(tidyr)
df1 =  data.frame(id = c(1,2,3,4,5),
                  date = dmy(c("15/01/2020", "03/12/2020", "20/08/2019" , "01/01/2021", "01/02/2021")),
                  type = c("type_A","type_A", "type_B", "type_C", "type_B"))


df1_update = data.frame(id = c(1,2,3),
                 date = dmy(c("25/01/2020", "03/12/2020", "20/08/2019")),
                 type = c("type_A","type_B", "type_B"),
                 new_info = c("note", "nil","note"))

df3 = left_join(df1, df1_update, by = "id")%>%
  mutate(date = case_when(is.na(date.y) ~ date.x, 
                          date.x == date.y ~ date.x,
                          date.x != date.y ~ date.y),
         type = case_when(is.na(type.y) ~type.x,
                          type.x == type.y ~ type.x,
                          type.x != type.y ~ type.y))%>%
  select(-contains(c(".x", ".y"))) 

df3

> df3
  id new_info       date   type
1  1     note 2020-01-25 type_A
2  2      nil 2020-12-03 type_B
3  3     note 2019-08-20 type_B
4  4     <NA> 2021-01-01 type_C
5  5     <NA> 2021-02-01 type_B

In my real data set I have around 16 columns that have been reviewed and updated.在我的真实数据集中,我有大约 16 列已经过审查和更新。 Is it possible to compare all pairs of columns ending in.x and.y without having to name each pair as I have above?是否可以比较所有以 .x 和 .y 结尾的列对,而不必像上面那样命名每一对? I'm guessing it may be possible by writing a function.我猜这可能是通过写一个 function 来实现的。

Here are examples of real column names, after left_join:以下是 left_join 之后的真实列名示例:

"Access_ID" "First_use_Date_After_Creation.x" "Last_use_Date" "Access_ID" "First_use_Date_After_Creation.x" "Last_use_Date"
[13] "StatusOnAccessDay.x" [13] "StatusOnAccessDay.x"
[43] "Access_Type.y" "Access_Site.y" "StatusOnAccessDay.y" [43] "Access_Type.y" "Access_Site.y" "StatusOnAccessDay.y"
[46] "Date_Construction.y" "Date_Of_First_Use.y" "First_use_Date_After_Creation.y" [46] “Date_Construction.y” “Date_Of_First_Use.y” “First_use_Date_After_Creation.y”
[49] "Date_Of_failure.y" "Date_Of_removal.y" [49] “Date_Of_failure.y” “Date_Of_removal.y”
[52] "Problem_item" "Problem_item.1" [52] “问题项” “问题项.1”

It may be easier with coalesce (if there are not much conditions or else can use case_when ).使用coalesce可能更容易(如果条件不多,否则可以使用case_when )。 In addition, assuming that there are always .y columns for the corresponding .x column, loop across the .x columns, replace the substring .x of the column name ( cur_column() ) with .y , get the value, apply case_when , update the column name within .name and remove the unused columns using .keep另外,假设对应的.x列总是有.y列,遍历.x列,将列名( across cur_column() )的.x替换为.yget值,应用case_when ,更新.name中的列名称并使用.keep删除unused的列

library(dplyr)
library(stringr)
left_join(df1, df1_update, by = "id") %>% 
   mutate(across(ends_with('.x'), 
   ~ {
      xdat <- as.character(.x)
      ydat <- as.character(get(str_replace(cur_column(), '\\.x', '.y')))
      case_when(is.na(ydat) ~ xdat, 
              xdat == ydat ~ xdat,
               xdat != ydat ~ ydat)
     }, 
      .names = "{str_remove(.col, '.x')}"), .keep = 'unused') %>%
   type.convert(as.is = TRUE)

-output -输出

 id new_info       date   type
1  1     note 2020-01-25 type_A
2  2      nil 2020-12-03 type_B
3  3     note 2019-08-20 type_B
4  4     <NA> 2021-01-01 type_C
5  5     <NA> 2021-02-01 type_B

Another way with a function: function 的另一种方式:

library(dplyr)
library(purrr)

coalesce_from_base <- function(df, base) {
  
  .x <- paste0(base, ".x")
  .y <- paste0(base, ".y")
  
  df %>% 
    mutate(!!ensym(base) := case_when(is.na(.data[[.y]]) ~ .data[[.x]], 
                                      .data[[.x]] == .data[[.y]] ~ .data[[.x]],
                                      .data[[.x]] != .data[[.y]] ~ .data[[.y]])) 
  
}

# join together
df3 <- left_join(df1, df1_update, by = "id")

# create a vector a fields to iterate over
col_base <- c("date", "type")
# col_base <- stringr::str_subset(names(df3), ".x$") # create this by code

# use reduce to cumulative iterate over each base value
reduce(col_base, coalesce_from_base, .init = df3) %>%
  select(-ends_with(c(".x", ".y"))) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM