简体   繁体   English

R:在分组条件下聚合大数据帧

[英]R: Aggregating Large Data Frame under a Grouping Condition

I'm trying to figure out the fastest way to aggregate a large data frame (about 50M rows) that looks similar to: 我正在尝试找出最快的方式来聚合类似于以下内容的大数据帧(约5000万行):

>sample_frame = data.frame("id" = rep(sample(1:100,2,replace=F),3),
+ "date" = sample(seq(as.Date("2014-01-01"),as.Date("2014-02-13"),by=1),6),
+ "value" = runif(6))
> sample_frame
  id       date      value
1 73 2014-02-11 0.84197491
2  7 2014-01-14 0.08057893
3 73 2014-01-16 0.78521616
4  7 2014-01-24 0.61889286
5 73 2014-02-06 0.54792356
6  7 2014-01-06 0.66484848

Here we have 2 unique IDs with 3 dates and a value assigned to each. 在这里,我们有2个具有3个日期的唯一ID,并为每个ID分配了一个值。 I know that I can use ddply, or data.table, or just a lapply to aggregate and find the mean for each ID. 我知道我可以使用ddply或data.table或仅使用lapply来聚合并找到每个ID的均值。

What I'm really looking for is a way to quickly find the mean for each ID for the most recent two dates. 我真正要寻找的是一种快速找到最近两个日期的每个ID均值的方法。 For example, with sapply: 例如,使用sapply:

> sapply(split(sample_frame,sample_frame$id),function(x){
+   mean(x$value[x$date%in%x$date[order(x$date,decreasing=T)][1:2]])
+ })
        7        73 
0.3497359 0.6949492

I can't figure out how to get data.table to do this. 我不知道如何获取data.table来做到这一点。 Thoughts? 思考? Hints? 提示?

Why not use tail in your "data.table" aggregation step? 为什么不在您的“ data.table”聚合步骤中使用tail

set.seed(1)
sample_frame = data.frame("id" = rep(sample(1:100,2,replace=F),3),
                          "date" = sample(seq(as.Date("2014-01-01"),
                                              as.Date("2014-02-13"),by=1),6),
                          "value" = runif(6))

DT <- data.table(sample_frame, key = "id,date")
DT
#    id       date      value
# 1: 27 2014-01-09 0.20597457
# 2: 27 2014-01-26 0.62911404
# 3: 27 2014-02-07 0.68702285
# 4: 37 2014-02-06 0.17655675
# 5: 37 2014-02-09 0.06178627
# 6: 37 2014-02-13 0.38410372
DT[, mean(tail(value, 2)), by = id]
#    id        V1
# 1: 27 0.6580684
# 2: 37 0.2229450

Since you require the mean of just two values, you can do it directly (without using mean ). 由于您只需要两个值的均值,因此可以直接进行操作(无需使用mean )。 And you can use the internal variable .N instead of tail to get more speed-up. 而且,您可以使用内部变量.N而不是tail来提高速度。 You just have to take care of the case where there's just 1 date. 您只需要照顾只有1个约会的情况。 Basically, this should be much faster. 基本上,这应该快得多。

DT[, (value[.N]+value[max(1L, .N-1)])/2, by=id]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM