Using the R programming language, I have the following two tables (in my actual problem, all dates are given to me in "factor" types):
table_1 = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))
table_1$id1 = as.factor(table_1$id1)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)
table_2 = data.frame(id1 = c("0123", "1233", "125 .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))
table_2$id1 = as.factor(table_2$id1)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)
I am trying to perform an "inner join" if either Condition 1 OR Condition 2 is true:
Condition_1
if table_1$id "fuzzy equal" table_2$id AND
if table_1$date BETWEEN(table_2$date_2,table_2$date_3)
Condition_2
Right now, I know how to do this in 2 part:
library(dplyr)
library(fuzzyjoin)
part_1 = stringdist_inner_join(table_1, table_2, by = "id1", max_dist = 2) %>%
filter(date_1 >= date_2, date_1 <= date_3)
part_2 = stringdist_inner_join(table_1, table_2, by = "id2", max_dist = 2)
combine = rbind(part_1, part_2)
final = combine[!duplicated(combine[c(1,2,3,4,5,6,7)]),]
My Question
Is there a "better" way to run this join together, instead of in two separate parts?
It seems that the SQL query in "part_1" is first performing a fuzzy join on all records, and then only keeping the relevant records that satisfy the date criteria, ie filter(date_1 >= date_2, date_1 <= date_3)
. This seems to be an inefficient way of doing things - or is this the only way possible to accomplish this task, since a fuzzy join must be run on all rows by default to see if "id" condition is satisfied, and only then the "date" condition can be satisfied?
Thanks!
If we want to do this in a loop, loop over the variable part ie the by
library(purrr)
library(fuzzyjoin)
library(dplyr)
final2 <- map_dfr(c("id1", "id2"), ~
stringdist_inner_join(table_1, table_2, by = .x, max_dist = 2)) %>%
distinct %>%
arrange(across(everything()))
-checking
> all.equal(final %>%
arrange(across(everything())), final2)
[1] TRUE
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.