简体   繁体   中英

How I can match multiple dataframes with multiple IDs in R

I have the following DFs, as an example

 df1<-read.table (text=" ID1 speed ID2 Time ID3 Income 4 60 5 100 3 300 3 80 2 90 7 400 2 90 6 100 6 600 ", header=TRUE)

 df2<-read.table (text=" ID Colour CA NA DC NO 2 YY N12 A B-12 3 BN M18 B B-17 6 RY M20 E B-22 4 PN M22 F B-27 7 BY M11 G B-32 ", header=TRUE)

 The expected outcome is ID1 speed1 Colour1 CA1 NA1 DC1 NO1 ID2 Time Colour2 CA2 NA2 DC2 NO2 ID3 Income Colour3 CA3 NA3 DC3 NO3 4 60 PN M22 F B-27 5 100 NA xx xx xx xx 3 300 xx xx xx xx xx 3 80 BN M18 B B-17 2 90 Y xx xx xx xx 7 400 xx xx xx xx xx 2 90 YY N12 A B-12 6 100 Y xx xx xx xx 6 600 xx xx xx xx xx

From the input and expected, it seems that we need a join individually on the 'ID' columns from 'df1' with that of 'ID' on 'df2'. Get the 'ID' column names ('nm1'), and the names of the 'df2' that are not found in 'df1'. Loop over the sequence of ID columns, do a join and assign ( := ) the values of 'nm2' columns by joining on the 'ID' with the corresponding 'ID1', 'ID2', 'ID3' from 'df1'

library(data.table)
df3 <- copy(df1)
nm1 <- grep("^ID\\d+$", names(df1), value = TRUE)
nm2 <- setdiff(setdiff(names(df2), names(df1)), "ID")
 
 setDT(df3)
 for(i in seq_along(nm1)) {
     
   df3[df2, paste0(nm2, i) := mget(nm2), on = setNames("ID", nm1[i])][]
 }

-output

df3
   ID1 speed ID2 Time ID3 Income Colour1 CA1 NA.1 DC1  NO1 Colour2  CA2 NA.2  DC2  NO2 Colour3 CA3 NA.3 DC3  NO3
1:   4    60   5  100   3    300       P   N  M22   F B-27    <NA> <NA> <NA> <NA> <NA>       B   N  M18   B B-17
2:   3    80   2   90   7    400       B   N  M18   B B-17       Y    Y  N12    A B-12       B   Y  M11   G B-32
3:   2    90   6  100   6    600       Y   Y  N12   A B-12       R    Y  M20    E B-22       R   Y  M20   E B-22

or another option is reshape to 'long' format with pivot_longer , do a join with left_join and then reshape back to 'wide' format with pivot_wider

library(dplyr)
library(tidyr)
library(readr)
df1 %>% 
   mutate(rn = row_number()) %>% 
   pivot_longer(cols = starts_with('ID'), values_to = 'ID') %>% 
   left_join(df2) %>%
   mutate(name = parse_number(name)) %>%
   pivot_wider(names_from = name, values_from = ID:NO, names_sep="") %>%
   select(-rn)

-output

# A tibble: 3 x 21
  speed  Time Income   ID1   ID2   ID3 Colour1 Colour2 Colour3 CA1   CA2   CA3   NA.1  NA.2  NA.3  DC1   DC2   DC3   NO1   NO2   NO3  
  <int> <int>  <int> <int> <int> <int> <chr>   <chr>   <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1    60   100    300     4     5     3 P       <NA>    B       N     <NA>  N     M22   <NA>  M18   F     <NA>  B     B-27  <NA>  B-17 
2    80    90    400     3     2     7 B       Y       B       N     Y     Y     M18   N12   M11   B     A     G     B-17  B-12  B-32 
3    90   100    600     2     6     6 Y       R       R       Y     Y     Y     N12   M20   M20   A     E     E     B-12  B-22  B-22 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM