joining tables with multiple common and dissimilar IDs DBplyr R

Question

I have a question that I have answer two but think it is inefficient and so want a better way. I am querying a database using DBplyr and joining tables based upon a few ID columns.

For example, I have one table (Table_1) in the database with an org_ID and a sender_ID, the first is the number for the organization in a given geography and the second a number for organizations with which they interacted in a given geography. So org_ID = 1 in geography 2, org_ID 1 in geography 3, and so forth.

I also have a single identification table (Table_2) with an org_ID column and a corresponding geography. There is not a sender_ID in Table_2, but the sender_ID in Table_2 does match org_ID/geography pairs in Table_1.

I want to combine these tables, but I need to use the Table_1 twice, effectively. What works, but is slow is the following

df <- Table_1 %>%
  left_join(Table_2, by=c('org_ID' = "org_ID" , 'geography' = 'geography' ))%>%
  left_join(Table_2, by=c('sender_ID' = "org_ID" , 'geography' = 'geography')) %>%
#various grouping and summary commands %>%
collect()

Any ideas for better ways?

Answer 1

If it is mainly a question of how to do:

df <- Table_1 %>%
  left_join(Table_2, by=c('org_ID' = "org_ID" , 'geography' = 'geography' ))%>%
  left_join(Table_2, by=c('sender_ID' = "org_ID" , 'geography' = 'geography')) %>%
  #various grouping and summary commands %>%
  collect()

more effectively (with only a single join), then I would suggest:

df <- Table_1 %>%
  left_join(Table_2, by = "geography", suffix = c(".x",".y") %>%
  filter(org_ID.x == org_ID.y | sender_ID.x == org_ID.y) %>%
  #various grouping and summary commands %>%
  collect()

However, imposes certain handling of records in Table_1 that do not appear in Table_2 : It behaves more like an inner join than a left join.

Regarding peformance:

Joins are faster when the underlying tables are indexed. Make sure the database has indexes on the columns you are joining on.
Even efficient commands/SQL will have minimal impact on speed if the connection between R and SQL is slow. collect() copies that data from SQL into R, this may be the slowest point of your code.

joining tables with multiple common and dissimilar IDs DBplyr R

Question

1 answers

solution1
0 2022-09-11 20:46:48

joining tables with multiple common and dissimilar IDs DBplyr R

Question

1 answers

solution1 0 2022-09-11 20:46:48

solution1
0 2022-09-11 20:46:48