I have 2 dataframes that I need to match based on at least x columns being the same. df1 has columns A:E; df2 has columns A:Z. Columns A:E are the same in both dfs, but the rows are in a different order.
df1 would look something like:
forename surname birthdate code gender
Joe Bloggs 23/03/2001 SW3 m
Anne Anderson 11/11/1999 D37 f
Tom Smith 31/01/2002 SW4 m
Andy Clarke 02/06/1999 B37 m
df2 would look like:
forename surname birthdate code gender eye_colour dinner_option
Jules Anderson 09/01/1986 D37 m blue meat
Katy Collins 03/03/2004 NA f brown meat
Andrew Clarke 02/06/1999 NA m brown veg
Joe Bloggs 23/03/2001 SW3 m green fish
What I need to do is:
So the output (df3) would look like the following
forename surname birthdate code gender forename surname birthdate
Joe Bloggs 23/03/2001 SW3 m Joe Bloggs 23/03/2001
Andy Clarke 02/06/1999 B37 m Andrew Clarke 02/06/1999
code gender eye_colour dinner_option
SW3 m green fish
NA m brown veg
As Joe Bloggs and Andy Clarke are the only ones where at least 3 of the columns match between df1 and df2.*
Any idea about how I could do this in an efficient way?
I've tried the following, but of course, this only identifies matches where ALL the columns are the same, whereas I only need 3 columns to match, not all of them.
colsToUse <- intersect(colnames(df1), colnames(df2))
matching <- match(do.call("paste", df1[, colsToUse]), do.call("paste", df2[, colsToUse]))
matched <- cbind(df1, df2[matching, ])
Thank you for any help!
*I do realise there is some redundant information in df3, but for now I need it to be like that
This is my ugly first attempt.
It works for your sample data, but probaly needs some (= a lot of) testing to find weaknesses.
library(data.table)
# !!df1 and df2 need to be data.table, so use fread() or setDT() !!
df1 <- fread("forename surname birthdate code gender
Joe Bloggs 23/03/2001 SW3 m
Anne Anderson 11/11/1999 D37 f
Tom Smith 31/01/2002 SW4 m
Andy Clarke 02/06/1999 B37 m")
df2 <- fread("forename surname birthdate code gender eye_colour dinner_option
Jules Anderson 09/01/1986 D37 m blue meat
Katy Collins 03/03/2004 NA f brown meat
Andrew Clarke 02/06/1999 NA m brown veg
Joe Bloggs 23/03/2001 SW3 m green fish", sep = " ")
# combinations of colnames to join on
col_join <- combn(intersect(names(df1), names(df2)), 3, simplify = FALSE)
# create df3 with dummy names
df3 <- df2
setnames(df3, paste0(names(df2), ".y"))
df3[, id := .I]
# Create expression to evaluate later
joins <- lapply(col_join, function(x) {
paste0(sapply(x, function(x) {
paste0(x, " = ", x, ".y")
}), collapse = ", ")
})
# update join df1 on all join-combinations (only one match possible per row!!)
lapply( joins, function(x) {
expr = paste0("df1[df3, id := i.id, on = .(", x, ")]")
eval(parse(text = expr))
})
# final join on matched rows
df3[df1[!is.na(id), ], on = .(id)][,id := NULL]
# forename.y surname.y birthdate.y code.y gender.y eye_colour.y dinner_option.y forename surname birthdate code gender
# 1: Joe Bloggs 23/03/2001 SW3 m green fish Joe Bloggs 23/03/2001 SW3 m
# 2: Andrew Clarke 02/06/1999 <NA> m brown veg Andy Clarke 02/06/1999 B37 m
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.