简体   繁体   中英

R: Combining Multiple Fuzzy Joins

Using the R programming language, I have the following two tables (in my actual problem, all dates are given to me in "factor" types):

table_1 = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))

table_1$id1 = as.factor(table_1$id1)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)

table_2 = data.frame(id1 = c("0123", "1233", "125  .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))


table_2$id1 = as.factor(table_2$id1)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)

I am trying to perform an "inner join" if either Condition 1 OR Condition 2 is true:

Condition_1

  • if table_1$id "fuzzy equal" table_2$id AND

  • if table_1$date BETWEEN(table_2$date_2,table_2$date_3)

Condition_2

  • if table_1$id2 "fuzzy equal" table_2$id2

Right now, I know how to do this in 2 part:

library(dplyr)
library(fuzzyjoin)

part_1 =  stringdist_inner_join(table_1, table_2, by = "id1", max_dist = 2) %>%
  filter(date_1 >= date_2, date_1 <= date_3) 

part_2 =  stringdist_inner_join(table_1, table_2, by = "id2", max_dist = 2) 

combine = rbind(part_1, part_2)

final = combine[!duplicated(combine[c(1,2,3,4,5,6,7)]),]

My Question

  • Is there a "better" way to run this join together, instead of in two separate parts?

  • It seems that the SQL query in "part_1" is first performing a fuzzy join on all records, and then only keeping the relevant records that satisfy the date criteria, ie filter(date_1 >= date_2, date_1 <= date_3) . This seems to be an inefficient way of doing things - or is this the only way possible to accomplish this task, since a fuzzy join must be run on all rows by default to see if "id" condition is satisfied, and only then the "date" condition can be satisfied?

Thanks!

If we want to do this in a loop, loop over the variable part ie the by

library(purrr)
library(fuzzyjoin)
library(dplyr)
final2 <- map_dfr(c("id1", "id2"),  ~
       stringdist_inner_join(table_1, table_2, by = .x, max_dist = 2)) %>%
       distinct %>%
       arrange(across(everything()))

-checking

> all.equal(final %>% 
         arrange(across(everything())), final2)
[1] TRUE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM