简体   繁体   中英

R - dplyr left_join() - Multiple Matches - How to Recombine…?

If I have the following:

x <- data.frame(
       Row_Index = 1:5,
       Name = c("Alan", "Bob", "Charles", "David", "Eric"),
       Age = c(49, 23, 44, 52, 18),
       City = c("London", "Paris", "Berlin", "Moscow", "Tokyo")
)

y <- data.frame(
       Claim_Reference = 1:6,
       Row_Index = c(3, 2, 2, 4, 6, 4),
       Claim_Amount = c(100, 1000, 500, 200, 300, 5000)
)

z <- x %>% left_join(y, by = c("Row_Index" = "Row_Index")) %>%
           group_by (Row_Index, Name, Age) %>%
           summarise(Total_Claim_Amount = sum(Claim_Amount))

it produces a nice joined table where for each individual in x I can see their Name , Age and Total_Claim_Amount . All ok.

It would be sufficient for grouping purposes to use Row_Index alone in the group_by() statement and skip Name and Age , but then they won't appear in the resulting table, which isn't what I want.

In a real life example, I'm doing exactly the same type of lookup, but with many more fields. my left join query has 55 variables inside the group_by() statement and 16 variables inside the summarise() statement. It's overwhelming my PC.

Is there a more efficient way to do this? It's something I need to do quite often. Should I, for example, move the "redundant" variables in the group_by() statement into the summarise statement, preceded by a first() or something like that?

Thank you.

z <- y %>% 
  group_by(Row_index) %>%
  summarize(...) %>% 
  right_join(x, by = "Row_index")
# same result, much more efficiently.

In your example, you add a bunch of columns to y with the join, 55 columns, with lots of repeated information. Grouping by and summarizing all those columns means R has to go through every single column and make sure there aren't any mismatches with row_index that would require the creation of a new group. You know that each row_index defines a group, so you should tell R to group only by row_index , do your summarize, and then do the join to add contextual information for each row_index . This should be exponentially faster with the number of columns.

If you want additional speed, you could switch to data.table , but my guess is this will adequately solve your speed problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM