[英]R - dplyr left_join() - Multiple Matches - How to Recombine…?
If I have the following:如果我有以下情况:
x <- data.frame(
Row_Index = 1:5,
Name = c("Alan", "Bob", "Charles", "David", "Eric"),
Age = c(49, 23, 44, 52, 18),
City = c("London", "Paris", "Berlin", "Moscow", "Tokyo")
)
y <- data.frame(
Claim_Reference = 1:6,
Row_Index = c(3, 2, 2, 4, 6, 4),
Claim_Amount = c(100, 1000, 500, 200, 300, 5000)
)
z <- x %>% left_join(y, by = c("Row_Index" = "Row_Index")) %>%
group_by (Row_Index, Name, Age) %>%
summarise(Total_Claim_Amount = sum(Claim_Amount))
it produces a nice joined table where for each individual in x
I can see their Name
, Age
and Total_Claim_Amount
.它产生了一个很好的连接表,对于x
中的每个人,我可以看到他们的Name
、 Age
和Total_Claim_Amount
。 All ok.一切都好。
It would be sufficient for grouping purposes to use Row_Index
alone in the group_by()
statement and skip Name
and Age
, but then they won't appear in the resulting table, which isn't what I want.出于分组目的,在group_by()
语句中单独使用Row_Index
并跳过Name
和Age
就足够了,但是它们不会出现在结果表中,这不是我想要的。
In a real life example, I'm doing exactly the same type of lookup, but with many more fields.在现实生活中的示例中,我正在执行完全相同类型的查找,但具有更多字段。 my left join query has 55 variables inside the group_by()
statement and 16 variables inside the summarise()
statement.我的左连接查询在group_by()
语句中有 55 个变量,在summarise()
语句中有 16 个变量。 It's overwhelming my PC.这让我的电脑不堪重负。
Is there a more efficient way to do this?有没有更有效的方法来做到这一点? It's something I need to do quite often.这是我需要经常做的事情。 Should I, for example, move the "redundant" variables in the group_by()
statement into the summarise statement, preceded by a first()
or something like that?例如,我是否应该将group_by()
语句中的“冗余”变量移动到 summarise 语句中,前面是first()
或类似的东西?
Thank you.谢谢你。
z <- y %>%
group_by(Row_index) %>%
summarize(...) %>%
right_join(x, by = "Row_index")
# same result, much more efficiently.
In your example, you add a bunch of columns to y
with the join, 55 columns, with lots of repeated information.在您的示例中,您使用连接将一堆列添加到y
,55 列,其中包含大量重复信息。 Grouping by and summarizing all those columns means R has to go through every single column and make sure there aren't any mismatches with row_index
that would require the creation of a new group.对所有这些列进行分组和汇总意味着 R 必须通过每一列 go 并确保与row_index
不存在任何需要创建新组的不匹配。 You know that each row_index
defines a group, so you should tell R to group only by row_index
, do your summarize, and then do the join to add contextual information for each row_index
.您知道每个row_index
定义一个组,因此您应该告诉 R 仅按row_index
分组,进行汇总,然后进行连接以添加每个row_index
的上下文信息。 This should be exponentially faster with the number of columns.随着列数的增加,这应该以指数方式更快。
If you want additional speed, you could switch to data.table
, but my guess is this will adequately solve your speed problem.如果你想要更高的速度,你可以切换到data.table
,但我猜这将充分解决你的速度问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.