简体   繁体   中英

Merge R dataframes with at least x columns matching

I have 2 dataframes that I need to match based on at least x columns being the same. df1 has columns A:E; df2 has columns A:Z. Columns A:E are the same in both dfs, but the rows are in a different order.

df1 would look something like:

forename surname   birthdate   code gender  
Joe      Bloggs    23/03/2001  SW3   m
Anne     Anderson  11/11/1999  D37   f
Tom      Smith     31/01/2002  SW4   m
Andy     Clarke    02/06/1999  B37   m

df2 would look like:

forename surname   birthdate   code  gender  eye_colour  dinner_option
Jules    Anderson  09/01/1986  D37    m      blue        meat
Katy     Collins   03/03/2004  NA     f      brown       meat
Andrew   Clarke    02/06/1999  NA     m      brown       veg
Joe      Bloggs    23/03/2001  SW3    m      green       fish

What I need to do is:

  1. compare cols A:E in df1 and df2
  2. find the rows in df2 A:E that match at least 3 columns of df1
  3. for the rows that match 3 or more columns, create df3 with df1[,A:E] and df2[,A:Z]

So the output (df3) would look like the following

forename surname   birthdate   code  gender forename surname   birthdate   
Joe      Bloggs    23/03/2001  SW3    m     Joe      Bloggs    23/03/2001  
Andy     Clarke    02/06/1999  B37    m     Andrew   Clarke    02/06/1999  

code gender  eye_colour  dinner_option
SW3   m      green       fish
NA    m      brown       veg

As Joe Bloggs and Andy Clarke are the only ones where at least 3 of the columns match between df1 and df2.*

Any idea about how I could do this in an efficient way?

I've tried the following, but of course, this only identifies matches where ALL the columns are the same, whereas I only need 3 columns to match, not all of them.

colsToUse <- intersect(colnames(df1), colnames(df2))
matching <- match(do.call("paste", df1[, colsToUse]), do.call("paste", df2[, colsToUse]))
matched <- cbind(df1, df2[matching, ])

Thank you for any help!

*I do realise there is some redundant information in df3, but for now I need it to be like that

This is my ugly first attempt.

It works for your sample data, but probaly needs some (= a lot of) testing to find weaknesses.

library(data.table)
# !!df1 and df2 need to be data.table, so use fread() or setDT() !!
df1 <- fread("forename surname   birthdate   code gender  
             Joe      Bloggs    23/03/2001  SW3   m
             Anne     Anderson  11/11/1999  D37   f
             Tom      Smith     31/01/2002  SW4   m
             Andy     Clarke    02/06/1999  B37   m")
df2 <- fread("forename surname   birthdate   code  gender  eye_colour  dinner_option
Jules    Anderson  09/01/1986  D37    m      blue        meat
Katy     Collins   03/03/2004  NA     f      brown       meat
Andrew   Clarke    02/06/1999  NA     m      brown       veg
Joe Bloggs    23/03/2001  SW3    m      green       fish", sep = " ")

# combinations of colnames to join on
col_join <- combn(intersect(names(df1), names(df2)), 3, simplify = FALSE)
# create df3 with dummy names
df3 <- df2
setnames(df3, paste0(names(df2), ".y"))
df3[, id := .I]
# Create expression to evaluate later
joins <- lapply(col_join, function(x) {
  paste0(sapply(x, function(x) {
    paste0(x, " = ", x, ".y")
    }), collapse = ", ")
  })
# update join df1 on all join-combinations (only one match possible per row!!)
lapply( joins, function(x) {
  expr = paste0("df1[df3, id := i.id, on = .(", x, ")]")
  eval(parse(text = expr))  
})
# final join on matched rows
df3[df1[!is.na(id), ], on = .(id)][,id := NULL]
#     forename.y surname.y birthdate.y code.y gender.y eye_colour.y dinner_option.y forename surname  birthdate code gender
# 1:        Joe    Bloggs  23/03/2001    SW3        m        green            fish      Joe  Bloggs 23/03/2001  SW3      m
# 2:     Andrew    Clarke  02/06/1999   <NA>        m        brown             veg     Andy  Clarke 02/06/1999  B37      m
    

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM