R - dplyr left_join() - 多个匹配 - 如何重新组合...？

Question

If I have the following:如果我有以下情况：

x <- data.frame(
       Row_Index = 1:5,
       Name = c("Alan", "Bob", "Charles", "David", "Eric"),
       Age = c(49, 23, 44, 52, 18),
       City = c("London", "Paris", "Berlin", "Moscow", "Tokyo")
)

y <- data.frame(
       Claim_Reference = 1:6,
       Row_Index = c(3, 2, 2, 4, 6, 4),
       Claim_Amount = c(100, 1000, 500, 200, 300, 5000)
)

z <- x %>% left_join(y, by = c("Row_Index" = "Row_Index")) %>%
           group_by (Row_Index, Name, Age) %>%
           summarise(Total_Claim_Amount = sum(Claim_Amount))

it produces a nice joined table where for each individual in x I can see their Name , Age and Total_Claim_Amount .它产生了一个很好的连接表，对于x中的每个人，我可以看到他们的Name 、 Age和Total_Claim_Amount 。 All ok.一切都好。

It would be sufficient for grouping purposes to use Row_Index alone in the group_by() statement and skip Name and Age , but then they won't appear in the resulting table, which isn't what I want.出于分组目的，在group_by()语句中单独使用Row_Index并跳过Name和Age就足够了，但是它们不会出现在结果表中，这不是我想要的。

In a real life example, I'm doing exactly the same type of lookup, but with many more fields.在现实生活中的示例中，我正在执行完全相同类型的查找，但具有更多字段。 my left join query has 55 variables inside the group_by() statement and 16 variables inside the summarise() statement.我的左连接查询在group_by()语句中有 55 个变量，在summarise()语句中有 16 个变量。 It's overwhelming my PC.这让我的电脑不堪重负。

Is there a more efficient way to do this?有没有更有效的方法来做到这一点？ It's something I need to do quite often.这是我需要经常做的事情。 Should I, for example, move the "redundant" variables in the group_by() statement into the summarise statement, preceded by a first() or something like that?例如，我是否应该将group_by()语句中的“冗余”变量移动到 summarise 语句中，前面是first()或类似的东西？

Thank you.谢谢你。

Answer 1

z <- y %>% 
  group_by(Row_index) %>%
  summarize(...) %>% 
  right_join(x, by = "Row_index")
# same result, much more efficiently.

In your example, you add a bunch of columns to y with the join, 55 columns, with lots of repeated information.在您的示例中，您使用连接将一堆列添加到y ，55 列，其中包含大量重复信息。 Grouping by and summarizing all those columns means R has to go through every single column and make sure there aren't any mismatches with row_index that would require the creation of a new group.对所有这些列进行分组和汇总意味着 R 必须通过每一列 go 并确保与row_index不存在任何需要创建新组的不匹配。 You know that each row_index defines a group, so you should tell R to group only by row_index , do your summarize, and then do the join to add contextual information for each row_index .您知道每个row_index定义一个组，因此您应该告诉 R 仅按row_index分组，进行汇总，然后进行连接以添加每个row_index的上下文信息。 This should be exponentially faster with the number of columns.随着列数的增加，这应该以指数方式更快。

If you want additional speed, you could switch to data.table , but my guess is this will adequately solve your speed problem.如果你想要更高的速度，你可以切换到data.table ，但我猜这将充分解决你的速度问题。

R - dplyr left_join() - 多个匹配 - 如何重新组合...？

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-11-14 16:56:11

R - dplyr left_join() - 多个匹配 - 如何重新组合...？

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-11-14 16:56:11

解决方案1
2 已采纳 2019-11-14 16:56:11