简体   繁体   中英

R data.table cumulative sum over time intervals

I have a table with values that exist during specific time intervals. I want a field that will sum up over all values for a given ID that exist during the start time of that interval.

Here is a paired down example:

x = data.table(ID = c(rep(1, 5), rep(2, 5)),
               DT_START = as.Date(c('2017-05-28', '2017-05-29', '2017-07-03', '2018-05-28', '2018-05-29', 
               '2019-07-03', '2019-10-08', '2020-05-28', '2020-05-29', '2020-07-03')),
               DT_END = as.Date(c('2018-05-28', '2018-05-29', '2018-07-03', '2018-05-29', '2019-05-28',
                                  '2019-10-08', '2020-07-03', '2021-05-28', '2021-05-29', '2020-10-03')),
               VALUE = c(300, 400, 200, 100, 150, 250, 350, 50, 10, 45))

The table looks like this:

x
    ID   DT_START     DT_END VALUE
 1:  1 2017-05-28 2018-05-28   300
 2:  1 2017-05-29 2018-05-29   400
 3:  1 2017-07-03 2018-07-03   200
 4:  1 2018-05-28 2018-05-29   100
 5:  1 2018-05-29 2019-05-28   150
 6:  2 2019-07-03 2019-10-08   250
 7:  2 2019-10-08 2020-07-03   350
 8:  2 2020-05-28 2021-05-28    50
 9:  2 2020-05-29 2021-05-29    10
10:  2 2020-07-03 2020-10-03    45

In the first row, that's the first start date for that ID and there are no equal dates, so the cumulative value would be just 300. By the second row, we now add the 300+400 to get 700, because as of 5/29/2017, both the 400 and the 300 were active for ID 1. The full desired output vector is obtained using the following code:

x[, VALUE_CUM := sum(x$VALUE * ifelse(x$ID == ID & x$DT_START <= DT_START & x$DT_END > DT_START, 1, 0)), by = .(ID, DT_START)]
x
    ID   DT_START     DT_END VALUE VALUE_CUM
 1:  1 2017-05-28 2018-05-28   300       300
 2:  1 2017-05-29 2018-05-29   400       700
 3:  1 2017-07-03 2018-07-03   200       900
 4:  1 2018-05-28 2018-05-29   100       700
 5:  1 2018-05-29 2019-05-28   150       350
 6:  2 2019-07-03 2019-10-08   250       250
 7:  2 2019-10-08 2020-07-03   350       350
 8:  2 2020-05-28 2021-05-28    50       400
 9:  2 2020-05-29 2021-05-29    10       410
10:  2 2020-07-03 2020-10-03    45       105

This is great but takes way to long on my huge data table with millions of rows. Any ideas for how to do this more elegantly so it takes faster?

Thanks!

Here is a possible way to do it:

y <- x[x, .(
    DT_END2 = i.DT_END,
    VALUE = i.VALUE, VALUE_CUM = sum(x.VALUE)),
    on = .(ID, DT_START <= DT_START, DT_END > DT_START), by = .EACHI]

# DT_END is overwritten by values of DT_START, so we use DT_END2 to backup the correct DT_END values.
y[, DT_END := DT_END2][, DT_END2 := NULL]

#     ID   DT_START     DT_END VALUE VALUE_CUM
#  1:  1 2017-05-28 2018-05-28   300       300
#  2:  1 2017-05-29 2018-05-29   400       700
#  3:  1 2017-07-03 2018-07-03   200       900
#  4:  1 2018-05-28 2018-05-29   100       700
#  5:  1 2018-05-29 2019-05-28   150       350
#  6:  2 2019-07-03 2019-10-08   250       250
#  7:  2 2019-10-08 2020-07-03   350       350
#  8:  2 2020-05-28 2021-05-28    50       400
#  9:  2 2020-05-29 2021-05-29    10       410
# 10:  2 2020-07-03 2020-10-03    45       105

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM