简体   繁体   English

R data.table 时间间隔内的累积总和

[英]R data.table cumulative sum over time intervals

I have a table with values that exist during specific time intervals.我有一个表,其中包含在特定时间间隔内存在的值。 I want a field that will sum up over all values for a given ID that exist during the start time of that interval.我想要一个字段来汇总在该间隔的开始时间内存在的给定 ID 的所有值。

Here is a paired down example:这是一个配对的例子:

x = data.table(ID = c(rep(1, 5), rep(2, 5)),
               DT_START = as.Date(c('2017-05-28', '2017-05-29', '2017-07-03', '2018-05-28', '2018-05-29', 
               '2019-07-03', '2019-10-08', '2020-05-28', '2020-05-29', '2020-07-03')),
               DT_END = as.Date(c('2018-05-28', '2018-05-29', '2018-07-03', '2018-05-29', '2019-05-28',
                                  '2019-10-08', '2020-07-03', '2021-05-28', '2021-05-29', '2020-10-03')),
               VALUE = c(300, 400, 200, 100, 150, 250, 350, 50, 10, 45))

The table looks like this:该表如下所示:

x
    ID   DT_START     DT_END VALUE
 1:  1 2017-05-28 2018-05-28   300
 2:  1 2017-05-29 2018-05-29   400
 3:  1 2017-07-03 2018-07-03   200
 4:  1 2018-05-28 2018-05-29   100
 5:  1 2018-05-29 2019-05-28   150
 6:  2 2019-07-03 2019-10-08   250
 7:  2 2019-10-08 2020-07-03   350
 8:  2 2020-05-28 2021-05-28    50
 9:  2 2020-05-29 2021-05-29    10
10:  2 2020-07-03 2020-10-03    45

In the first row, that's the first start date for that ID and there are no equal dates, so the cumulative value would be just 300. By the second row, we now add the 300+400 to get 700, because as of 5/29/2017, both the 400 and the 300 were active for ID 1. The full desired output vector is obtained using the following code:在第一行中,这是该 ID 的第一个开始日期,并且没有相等的日期,因此累积值将只有 300。到第二行,我们现在添加 300+400 得到 700,因为截至 5/ 29/2017,400 和 300 都对 ID 1 处于活动状态。使用以下代码获得完整的所需 output 向量:

x[, VALUE_CUM := sum(x$VALUE * ifelse(x$ID == ID & x$DT_START <= DT_START & x$DT_END > DT_START, 1, 0)), by = .(ID, DT_START)]
x
    ID   DT_START     DT_END VALUE VALUE_CUM
 1:  1 2017-05-28 2018-05-28   300       300
 2:  1 2017-05-29 2018-05-29   400       700
 3:  1 2017-07-03 2018-07-03   200       900
 4:  1 2018-05-28 2018-05-29   100       700
 5:  1 2018-05-29 2019-05-28   150       350
 6:  2 2019-07-03 2019-10-08   250       250
 7:  2 2019-10-08 2020-07-03   350       350
 8:  2 2020-05-28 2021-05-28    50       400
 9:  2 2020-05-29 2021-05-29    10       410
10:  2 2020-07-03 2020-10-03    45       105

This is great but takes way to long on my huge data table with millions of rows.这很棒,但在我拥有数百万行的庞大数据表上需要很长时间。 Any ideas for how to do this more elegantly so it takes faster?关于如何更优雅地做到这一点的任何想法,所以它需要更快?

Thanks!谢谢!

Here is a possible way to do it:这是一种可能的方法:

y <- x[x, .(
    DT_END2 = i.DT_END,
    VALUE = i.VALUE, VALUE_CUM = sum(x.VALUE)),
    on = .(ID, DT_START <= DT_START, DT_END > DT_START), by = .EACHI]

# DT_END is overwritten by values of DT_START, so we use DT_END2 to backup the correct DT_END values.
y[, DT_END := DT_END2][, DT_END2 := NULL]

#     ID   DT_START     DT_END VALUE VALUE_CUM
#  1:  1 2017-05-28 2018-05-28   300       300
#  2:  1 2017-05-29 2018-05-29   400       700
#  3:  1 2017-07-03 2018-07-03   200       900
#  4:  1 2018-05-28 2018-05-29   100       700
#  5:  1 2018-05-29 2019-05-28   150       350
#  6:  2 2019-07-03 2019-10-08   250       250
#  7:  2 2019-10-08 2020-07-03   350       350
#  8:  2 2020-05-28 2021-05-28    50       400
#  9:  2 2020-05-29 2021-05-29    10       410
# 10:  2 2020-07-03 2020-10-03    45       105

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM