简体   繁体   中英

joining tables with multiple common and dissimilar IDs DBplyr R

I have a question that I have answer two but think it is inefficient and so want a better way. I am querying a database using DBplyr and joining tables based upon a few ID columns.

For example, I have one table (Table_1) in the database with an org_ID and a sender_ID, the first is the number for the organization in a given geography and the second a number for organizations with which they interacted in a given geography. So org_ID = 1 in geography 2, org_ID 1 in geography 3, and so forth.

I also have a single identification table (Table_2) with an org_ID column and a corresponding geography. There is not a sender_ID in Table_2, but the sender_ID in Table_2 does match org_ID/geography pairs in Table_1.

I want to combine these tables, but I need to use the Table_1 twice, effectively. What works, but is slow is the following

df <- Table_1 %>%
  left_join(Table_2, by=c('org_ID' = "org_ID" , 'geography' = 'geography' ))%>%
  left_join(Table_2, by=c('sender_ID' = "org_ID" , 'geography' = 'geography')) %>%
#various grouping and summary commands %>%
collect()

Any ideas for better ways?

If it is mainly a question of how to do:

df <- Table_1 %>%
  left_join(Table_2, by=c('org_ID' = "org_ID" , 'geography' = 'geography' ))%>%
  left_join(Table_2, by=c('sender_ID' = "org_ID" , 'geography' = 'geography')) %>%
  #various grouping and summary commands %>%
  collect()

more effectively (with only a single join), then I would suggest:

df <- Table_1 %>%
  left_join(Table_2, by = "geography", suffix = c(".x",".y") %>%
  filter(org_ID.x == org_ID.y | sender_ID.x == org_ID.y) %>%
  #various grouping and summary commands %>%
  collect()

However, imposes certain handling of records in Table_1 that do not appear in Table_2 : It behaves more like an inner join than a left join.

Regarding peformance:

  • Joins are faster when the underlying tables are indexed. Make sure the database has indexes on the columns you are joining on.
  • Even efficient commands/SQL will have minimal impact on speed if the connection between R and SQL is slow. collect() copies that data from SQL into R, this may be the slowest point of your code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM