简体   繁体   中英

R data.table sum of group subset using dates

I have a dataset like the following:

library(data.table)    
dt1 <- data.table(urn = c(rep("a", 5), rep("b", 4)),
                  amount = c(10, 12, 23, 15, 19, 42, 11, 5, 10),
                  date = as.Date(c("2016-01-01", "2017-01-02", "2017-02-04",
                                   "2017-04-19", "2018-02-11", "2016-02-14",
                                   "2017-05-06", "2017-05-12", "2017-12-12")))
dt1
#    urn amount       date
# 1:   a     10 2016-01-01
# 2:   a     12 2017-01-02
# 3:   a     23 2017-02-04
# 4:   a     15 2017-04-19
# 5:   a     19 2018-02-11
# 6:   b     42 2016-02-14
# 7:   b     11 2017-05-06
# 8:   b      5 2017-05-12
# 9:   b     10 2017-12-12

I am trying to determine the cumulative value for a group over the preceding 12 months. I know I can use shift with data.table to scan backwards or forwards, the biggest challenge I can't get my head around is how to know how many records to sum when the number can change based on how many records each urn has.

The type of results I am looking for are:

dt1
#    urn amount       date summed12m
# 1:   a     10 2016-01-01        10
# 2:   a     12 2017-01-02        12
# 3:   a     23 2017-02-04        35
# 4:   a     15 2017-04-19        50
# 5:   a     19 2018-02-11        34
# 6:   b     42 2016-02-14        42
# 7:   b     11 2017-05-06        11
# 8:   b      5 2017-05-12        16
# 9:   b     10 2017-12-12        26   

I'm preferably looking for a data.table solution due to the volume of my data, but am open to other options too if it is likely to be efficient over a table with about 12M records.

As alternative to foverlaps() , this also can be solved by aggregating in a non-equi join :

library(lubridate)
dt1[, summed12m := dt1[.(urn, date, date %m-% months(12)), 
                       on = .(urn = V1, date <= V2, date >= V3), 
                       sum(amount), by = .EACHI]$V1][]
  urn amount date summed12m 1: a 10 2016-01-01 10 2: a 12 2017-01-02 12 3: a 23 2017-02-04 35 4: a 15 2017-04-19 50 5: a 19 2018-02-11 34 6: b 42 2016-02-14 42 7: b 11 2017-05-06 11 8: b 5 2017-05-12 16 9: b 10 2017-12-12 26 

lubridate is used for date arithmetic to avoid mishaps in case one of the dates is February, 29.

The essential part is the non-equi join

dt1[.(urn, date, date %m-% months(12)), 
    on = .(urn = V1, date <= V2, date >= V3), 
    sum(amount), by = .EACHI]
  urn date date V1 1: a 2016-01-01 2015-01-01 10 2: a 2017-01-02 2016-01-02 12 3: a 2017-02-04 2016-02-04 35 4: a 2017-04-19 2016-04-19 50 5: a 2018-02-11 2017-02-11 34 6: b 2016-02-14 2015-02-14 42 7: b 2017-05-06 2016-05-06 11 8: b 2017-05-12 2016-05-12 16 9: b 2017-12-12 2016-12-12 26 

of which the last column is picked to create the new summed12m column in dt1 .

Additional explanation

The OP has asked where V1 , V2 , and V3 come from.

The expression .(urn, date, date %m-% months(12)) creates a new data.table on the fly. ( .() is an data.table abbreviation for list() ). As no column names have been specified, data.table creates default column names V1 , V2 , etc.

Less sloppily, the expression can be re-written with explicitely named columns

dt1[.(urn = urn, end = date, start = date %m-% months(12)), 
    on = .(urn, date <= end, date >= start), 
    sum(amount), by = .EACHI]

This is shouting for foverlaps . My first time using foverlaps , hence I am quite sure a few experts here can better wield the function. Here it goes:

dt1[, date2 := date]
rng <- dt1[, .(urn, enddate=date, 
    startdate=as.Date(paste(year(date)-1, month(date), mday(date), sep="-")))]
setkey(rng, urn, startdate, enddate)
foverlaps(dt1, rng, by.x=c("urn","date","date2"), type="within")[,
    sum(amount), by=.(urn, enddate)]

#    urn    enddate V1
# 1:   a 2016-01-01 10
# 2:   a 2017-01-02 12
# 3:   a 2017-02-04 35
# 4:   a 2017-04-19 50
# 5:   a 2018-02-11 34
# 6:   b 2016-02-14 42
# 7:   b 2017-05-06 11
# 8:   b 2017-05-12 16
# 9:   b 2017-12-12 26

Further reading:

  1. How to perform join over date ranges using data.table?
  2. roll join with start/end window

Hope this helps!

dt1[, summed12m := {
        date_diff <- date - dt1$date
        sum(dt1$amount[date_diff >= 0 & date_diff <= 365 & urn==dt1$urn])
      }, 
      by = list(date, urn)]

Output is:

   urn amount       date summed12m
1:   a     10 2016-01-01        10
2:   a     12 2017-01-02        12
3:   a     23 2017-02-04        35
4:   a     15 2017-04-19        50
5:   a     19 2018-02-11        34
6:   b     42 2016-02-14        42
7:   b     11 2017-05-06        11
8:   b      5 2017-05-12        16
9:   b     10 2017-12-12        26

Sample data:

dt1 <- structure(list(urn = c("a", "a", "a", "a", "a", "b", "b", "b", 
"b"), amount = c(10, 12, 23, 15, 19, 42, 11, 5, 10), date = structure(c(16801, 
17168, 17201, 17275, 17573, 16845, 17292, 17298, 17512), class = "Date")), .Names = c("urn", 
"amount", "date"), row.names = c(NA, -9L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x00000000027b0788>)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM