简体   繁体   English

R - dplyr left_join() - 多个匹配 - 如何重新组合...?

[英]R - dplyr left_join() - Multiple Matches - How to Recombine…?

If I have the following:如果我有以下情况:

x <- data.frame(
       Row_Index = 1:5,
       Name = c("Alan", "Bob", "Charles", "David", "Eric"),
       Age = c(49, 23, 44, 52, 18),
       City = c("London", "Paris", "Berlin", "Moscow", "Tokyo")
)

y <- data.frame(
       Claim_Reference = 1:6,
       Row_Index = c(3, 2, 2, 4, 6, 4),
       Claim_Amount = c(100, 1000, 500, 200, 300, 5000)
)

z <- x %>% left_join(y, by = c("Row_Index" = "Row_Index")) %>%
           group_by (Row_Index, Name, Age) %>%
           summarise(Total_Claim_Amount = sum(Claim_Amount))

it produces a nice joined table where for each individual in x I can see their Name , Age and Total_Claim_Amount .它产生了一个很好的连接表,对于x中的每个人,我可以看到他们的NameAgeTotal_Claim_Amount All ok.一切都好。

It would be sufficient for grouping purposes to use Row_Index alone in the group_by() statement and skip Name and Age , but then they won't appear in the resulting table, which isn't what I want.出于分组目的,在group_by()语句中单独使用Row_Index并跳过NameAge就足够了,但是它们不会出现在结果表中,这不是我想要的。

In a real life example, I'm doing exactly the same type of lookup, but with many more fields.在现实生活中的示例中,我正在执行完全相同类型的查找,但具有更多字段。 my left join query has 55 variables inside the group_by() statement and 16 variables inside the summarise() statement.我的左连接查询在group_by()语句中有 55 个变量,在summarise()语句中有 16 个变量。 It's overwhelming my PC.这让我的电脑不堪重负。

Is there a more efficient way to do this?有没有更有效的方法来做到这一点? It's something I need to do quite often.这是我需要经常做的事情。 Should I, for example, move the "redundant" variables in the group_by() statement into the summarise statement, preceded by a first() or something like that?例如,我是否应该将group_by()语句中的“冗余”变量移动到 summarise 语句中,前面是first()或类似的东西?

Thank you.谢谢你。

z <- y %>% 
  group_by(Row_index) %>%
  summarize(...) %>% 
  right_join(x, by = "Row_index")
# same result, much more efficiently.

In your example, you add a bunch of columns to y with the join, 55 columns, with lots of repeated information.在您的示例中,您使用连接将一堆列添加到y ,55 列,其中包含大量重复信息。 Grouping by and summarizing all those columns means R has to go through every single column and make sure there aren't any mismatches with row_index that would require the creation of a new group.对所有这些列进行分组和汇总意味着 R 必须通过每一列 go 并确保与row_index不存在任何需要创建新组的不匹配。 You know that each row_index defines a group, so you should tell R to group only by row_index , do your summarize, and then do the join to add contextual information for each row_index .知道每个row_index定义一个组,因此您应该告诉 R 仅按row_index分组,进行汇总,然后进行连接以添加每个row_index的上下文信息。 This should be exponentially faster with the number of columns.随着列数的增加,这应该以指数方式更快。

If you want additional speed, you could switch to data.table , but my guess is this will adequately solve your speed problem.如果你想要更高的速度,你可以切换到data.table ,但我猜这将充分解决你的速度问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM