R data.table cumulative sum over time intervals

Question

I have a table with values that exist during specific time intervals. I want a field that will sum up over all values for a given ID that exist during the start time of that interval.

Here is a paired down example:

x = data.table(ID = c(rep(1, 5), rep(2, 5)),
               DT_START = as.Date(c('2017-05-28', '2017-05-29', '2017-07-03', '2018-05-28', '2018-05-29', 
               '2019-07-03', '2019-10-08', '2020-05-28', '2020-05-29', '2020-07-03')),
               DT_END = as.Date(c('2018-05-28', '2018-05-29', '2018-07-03', '2018-05-29', '2019-05-28',
                                  '2019-10-08', '2020-07-03', '2021-05-28', '2021-05-29', '2020-10-03')),
               VALUE = c(300, 400, 200, 100, 150, 250, 350, 50, 10, 45))

The table looks like this:

x
    ID   DT_START     DT_END VALUE
 1:  1 2017-05-28 2018-05-28   300
 2:  1 2017-05-29 2018-05-29   400
 3:  1 2017-07-03 2018-07-03   200
 4:  1 2018-05-28 2018-05-29   100
 5:  1 2018-05-29 2019-05-28   150
 6:  2 2019-07-03 2019-10-08   250
 7:  2 2019-10-08 2020-07-03   350
 8:  2 2020-05-28 2021-05-28    50
 9:  2 2020-05-29 2021-05-29    10
10:  2 2020-07-03 2020-10-03    45

In the first row, that's the first start date for that ID and there are no equal dates, so the cumulative value would be just 300. By the second row, we now add the 300+400 to get 700, because as of 5/29/2017, both the 400 and the 300 were active for ID 1. The full desired output vector is obtained using the following code:

x[, VALUE_CUM := sum(x$VALUE * ifelse(x$ID == ID & x$DT_START <= DT_START & x$DT_END > DT_START, 1, 0)), by = .(ID, DT_START)]
x
    ID   DT_START     DT_END VALUE VALUE_CUM
 1:  1 2017-05-28 2018-05-28   300       300
 2:  1 2017-05-29 2018-05-29   400       700
 3:  1 2017-07-03 2018-07-03   200       900
 4:  1 2018-05-28 2018-05-29   100       700
 5:  1 2018-05-29 2019-05-28   150       350
 6:  2 2019-07-03 2019-10-08   250       250
 7:  2 2019-10-08 2020-07-03   350       350
 8:  2 2020-05-28 2021-05-28    50       400
 9:  2 2020-05-29 2021-05-29    10       410
10:  2 2020-07-03 2020-10-03    45       105

This is great but takes way to long on my huge data table with millions of rows. Any ideas for how to do this more elegantly so it takes faster?

Thanks!

Answer 1

Here is a possible way to do it:

y <- x[x, .(
    DT_END2 = i.DT_END,
    VALUE = i.VALUE, VALUE_CUM = sum(x.VALUE)),
    on = .(ID, DT_START <= DT_START, DT_END > DT_START), by = .EACHI]

# DT_END is overwritten by values of DT_START, so we use DT_END2 to backup the correct DT_END values.
y[, DT_END := DT_END2][, DT_END2 := NULL]

#     ID   DT_START     DT_END VALUE VALUE_CUM
#  1:  1 2017-05-28 2018-05-28   300       300
#  2:  1 2017-05-29 2018-05-29   400       700
#  3:  1 2017-07-03 2018-07-03   200       900
#  4:  1 2018-05-28 2018-05-29   100       700
#  5:  1 2018-05-29 2019-05-28   150       350
#  6:  2 2019-07-03 2019-10-08   250       250
#  7:  2 2019-10-08 2020-07-03   350       350
#  8:  2 2020-05-28 2021-05-28    50       400
#  9:  2 2020-05-29 2021-05-29    10       410
# 10:  2 2020-07-03 2020-10-03    45       105

R data.table cumulative sum over time intervals

Question

1 answers

solution1
3 ACCPTED 2021-05-04 01:49:28

R data.table cumulative sum over time intervals

Question

1 answers

solution1 3 ACCPTED 2021-05-04 01:49:28

solution1
3 ACCPTED 2021-05-04 01:49:28