简体   繁体   English

跨列通过至少一个合并 ID 合并两个 R 数据帧

[英]Merge two R dataframes by at least one merge ID across columns

I have a joining problem that I'm struggling with in that the join IDs I want to use for separate dataframes are spread out across three possible ID columns.我遇到了一个正在努力解决的连接问题,因为我想用于单独数据帧的连接 ID 分布在三个可能的 ID 列中。 I'd like to be able to join if at least one join ID matches.如果至少有一个加入 ID 匹配,我希望能够加入。 I know the _join and merge functions accept a vector of column names but is it possible to make this work conditionally?我知道 _join 和 merge 函数接受列名向量,但是否可以有条件地完成这项工作?

For example, if I have the following two data frames:例如,如果我有以下两个数据框:

df_A <- data.frame(dta = c("FOO", "BAR", "GOO"),
                   id1 = c("abc", "", "bcd"),
                   id2 = c("", "", "xyz"),
                   id3 = c("def", "fgh", ""), stringsAsFactors = F)


df_B <- data.frame(dta = c("FUU", "PAR", "KOO"),
                   id1 = c("abc", "", ""),
                   id2 = c("", "xyz", "zzz"),
                   id3 = c("", "", ""), stringsAsFactors = F)


> df_A
 dta id1 id2 id3
1 FOO abc     def
2 BAR         fgh
3 GOO bcd xyz   

> df_B
  dta id1 id2 id3
1 FUU abc        
2 PAR     xyz    
3 KOO     zzz  

I hope to end up with something like this:我希望最终得到这样的结果:

 dta.x dta.y id1  id2  id3  
1 FOO  FUU   abc  ""   def    [matched on id1]
2 BAR  ""    ""   ""   fgh      [unmatched]
3 GOO  PAR   bcd  xyz  ""    [matched on id2]
4 KOO  ""    ""   zzz  ""      [unmatched]

So that unmatched dta1 and dta1 variables are retained but where there is a match (row 1 + 3 above) both dta1 and dta2 are joined in the new table.这样不匹配的 dta1 和 dta1 变量将被保留,但是在匹配的地方(上面的第 1 + 3 行), dta1 和 dta2 都被连接到新表中。 I have a sense that neither _join, merge, or match will work as is and that I'd need to write a function but I'm not sure where to start.我有一种感觉,_join、merge 或 match 都不会按原样工作,我需要编写一个 function 但我不确定从哪里开始。 Any help or ideas appreciated.任何帮助或想法表示赞赏。 Thank you谢谢

Basically, what you want to do is join by corresponding IDs, what you can do is to convert the original id columns to id_column and id_value , because you don't want to join with "", do I dropped it.基本上,您要做的是通过相应的ID加入,您可以做的是将原始的 id 列转换为id_columnid_value ,因为您不想用“”加入,所以我放弃了它。

library(tidyverse)
df_A_long <- df_A %>%
    pivot_longer(
        cols = -dta,
        names_to = "id_column",
        values_to = "id_value"
    ) %>%
    dplyr::filter(id_value != "")


df_B_long <- df_B %>%
    pivot_longer(
        cols = -dta,
        names_to = "id_column",
        values_to = "id_value"
    ) %>%
    dplyr::filter(id_value != "")

We always use id_column and id_value to join A & B.我们总是使用id_columnid_value来加入 A & B。

> df_B_long
# A tibble: 3 x 3
  dta   id_column id_value
  <chr> <chr>     <chr>   
1 FUU   id1       abc     
2 PAR   id2       xyz     
3 KOO   id2       zzz 

The joining part is clear, but to create your desired output, we need to do some data wrangling to make it look identical.连接部分很清楚,但是要创建您想要的 output,我们需要进行一些数据整理以使其看起来相同。

df_joined <- df_A_long %>%
    # join using id_column and id_value
    full_join(df_B_long, by = c("id_column","id_value"),suffix = c("1","2")) %>%
    # pivot back to long format
    pivot_wider(
        id_cols = c(dta1,dta2),
        names_from = id_column,
        values_from = id_value
    ) %>%
    # if dta1 is missing, then in the same row, move value from dta2 to dta1
    mutate(
        dta1_has_value = !is.na(dta1), # helper column
        dta1 = ifelse(dta1_has_value,dta1,dta2),
        dta2 = ifelse(!dta1_has_value & !is.na(dta2),NA,dta2)
    ) %>%
    select(-dta1_has_value) %>%
    group_by(dta1) %>%
    # condense multiple rows into one row
    summarise_all(
        ~ifelse(all(is.na(.x)),"",.x[!is.na(.x)])
    ) %>%
    # reorder columns
    {
        .[sort(colnames(df_joined))]
    }

Result:结果:

> df_joined
# A tibble: 4 x 5
  dta1  dta2  id1   id2   id3  
  <chr> <chr> <chr> <chr> <chr>
1 BAR   ""    ""    ""    fgh  
2 FOO   FUU   abc   ""    def  
3 GOO   PAR   bcd   xyz   ""   
4 KOO   ""    ""    zzz   ""   
library(sqldf)
one <- 
  sqldf('
    select  a.*
            , b.dta as dta_b
    from    df_A a
            left join df_B b
              on  a.id1 <> ""
                  and (
                    a.id1 = b.id1
                    or a.id2 = b.id2)

  ')

two <- 
  sqldf('
    select  b.*
    from    df_B b
            left join one
              on  b.dta = one.dta
                  or b.dta = one.dta_b
    where   one.dta is null
  ')

dplyr::bind_rows(one, two)
#   dta id1 id2 id3 dta_b
# 1 FOO abc     def   FUU
# 2 BAR         fgh  <NA>
# 3 GOO bcd xyz       PAR
# 4 KOO     zzz      <NA>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM