I have a table with values that exist during specific time intervals. I want a field that will sum up over all values for a given ID that exist during the start time of that interval.
Here is a paired down example:
x = data.table(ID = c(rep(1, 5), rep(2, 5)),
DT_START = as.Date(c('2017-05-28', '2017-05-29', '2017-07-03', '2018-05-28', '2018-05-29',
'2019-07-03', '2019-10-08', '2020-05-28', '2020-05-29', '2020-07-03')),
DT_END = as.Date(c('2018-05-28', '2018-05-29', '2018-07-03', '2018-05-29', '2019-05-28',
'2019-10-08', '2020-07-03', '2021-05-28', '2021-05-29', '2020-10-03')),
VALUE = c(300, 400, 200, 100, 150, 250, 350, 50, 10, 45))
The table looks like this:
x
ID DT_START DT_END VALUE
1: 1 2017-05-28 2018-05-28 300
2: 1 2017-05-29 2018-05-29 400
3: 1 2017-07-03 2018-07-03 200
4: 1 2018-05-28 2018-05-29 100
5: 1 2018-05-29 2019-05-28 150
6: 2 2019-07-03 2019-10-08 250
7: 2 2019-10-08 2020-07-03 350
8: 2 2020-05-28 2021-05-28 50
9: 2 2020-05-29 2021-05-29 10
10: 2 2020-07-03 2020-10-03 45
In the first row, that's the first start date for that ID and there are no equal dates, so the cumulative value would be just 300. By the second row, we now add the 300+400 to get 700, because as of 5/29/2017, both the 400 and the 300 were active for ID 1. The full desired output vector is obtained using the following code:
x[, VALUE_CUM := sum(x$VALUE * ifelse(x$ID == ID & x$DT_START <= DT_START & x$DT_END > DT_START, 1, 0)), by = .(ID, DT_START)]
x
ID DT_START DT_END VALUE VALUE_CUM
1: 1 2017-05-28 2018-05-28 300 300
2: 1 2017-05-29 2018-05-29 400 700
3: 1 2017-07-03 2018-07-03 200 900
4: 1 2018-05-28 2018-05-29 100 700
5: 1 2018-05-29 2019-05-28 150 350
6: 2 2019-07-03 2019-10-08 250 250
7: 2 2019-10-08 2020-07-03 350 350
8: 2 2020-05-28 2021-05-28 50 400
9: 2 2020-05-29 2021-05-29 10 410
10: 2 2020-07-03 2020-10-03 45 105
This is great but takes way to long on my huge data table with millions of rows. Any ideas for how to do this more elegantly so it takes faster?
Thanks!
Here is a possible way to do it:
y <- x[x, .(
DT_END2 = i.DT_END,
VALUE = i.VALUE, VALUE_CUM = sum(x.VALUE)),
on = .(ID, DT_START <= DT_START, DT_END > DT_START), by = .EACHI]
# DT_END is overwritten by values of DT_START, so we use DT_END2 to backup the correct DT_END values.
y[, DT_END := DT_END2][, DT_END2 := NULL]
# ID DT_START DT_END VALUE VALUE_CUM
# 1: 1 2017-05-28 2018-05-28 300 300
# 2: 1 2017-05-29 2018-05-29 400 700
# 3: 1 2017-07-03 2018-07-03 200 900
# 4: 1 2018-05-28 2018-05-29 100 700
# 5: 1 2018-05-29 2019-05-28 150 350
# 6: 2 2019-07-03 2019-10-08 250 250
# 7: 2 2019-10-08 2020-07-03 350 350
# 8: 2 2020-05-28 2021-05-28 50 400
# 9: 2 2020-05-29 2021-05-29 10 410
# 10: 2 2020-07-03 2020-10-03 45 105
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.