简体   繁体   中英

How do I create a moving average based on weekly dates, grouped by multiple columns in data.table?

I am reading in an extremely large dataset as a data.table for speed. The relevant columns are DATE (weekly data in year-month-day strings eg "2017-12-25"), V1 (Integer), V2 (String), V3 (Numeric). I would like to produce V4 which is the moving average of V3 , for the last 3 weeks ( DATE , DATE -7, and DATE -14) here is a naive attempt/solution, which is terribly inefficient:

dt <- fread("largefile.csv")

dt$DATE <- as.IDate(dt$DATE) //convert dates to date format

V1_list <- sort(unique(dt$V1))

V2_list <- sort(unique(dt$V2))

DATE_list <- sort(unique(dt$DATE))

for(i in 1:length(V1_list)){
for(j in 1:length(V2_list)){
for(k in 3:length(DATE_list){
dt[which(dt$V1 == V1_list[i] && dt$V2 == V2_list[j] && dt$DATE == DATE_list[k]),"V4"] 
<- mean(dt[which(dt$V1 == V1_list[i] && dt$V2 == V2_list[j] && dt$DATE %in% DATE_list[k-2:k]),"V3"])
}
}
}

I am avoiding using plyr partly due to computational constraints given the 50M rows I'm using. I have investigated options with setkey() and zoo / rolling functions but I am unable to figure out how to layer in the date component (assuming I group by V1 , V2 and average V3 ). Apologies for not providing sample code.

The OP has requested to append a new column which is the rolling average of V3 over the past 3 weeks grouped by V1 and V2 for a data.table of 50 M rows.

If the DATE values are without gap , ie, without missing weeks in all groups, one possible approach is to use the rollmeanr() function from the zoo package:

DT[order(DATE), V4 := zoo::rollmeanr(V3, 3L, fill = NA), by = .(V1, V2)]
DT[order(V1, V2, DATE)]
  DATE V1 V2 V3 V4 1: 2017-12-04 1 A 1 NA 2: 2017-12-11 1 A 2 NA 3: 2017-12-18 1 A 3 2 4: 2017-12-25 1 A 4 3 5: 2017-12-04 1 B 5 NA 6: 2017-12-11 1 B 6 NA 7: 2017-12-18 1 B 7 6 8: 2017-12-25 1 B 8 7 9: 2017-12-04 2 A 9 NA 10: 2017-12-11 2 A 10 NA 11: 2017-12-18 2 A 11 10 12: 2017-12-25 2 A 12 11 13: 2017-12-04 2 B 13 NA 14: 2017-12-11 2 B 14 NA 15: 2017-12-18 2 B 15 14 16: 2017-12-25 2 B 16 15 

Note that the NA s are purposefully introduced because we do not have DATE -7 and DATE -14 values for the first two rows within each group.

Also note that this approach does not require type conversion of the character dates.

Data

According to OP's description, the data.table has 4 columns: DATE are weekly character dates in standard unambiguous format %Y-%m-%d , V1 is of type integer, V2 is of type character, and V3 is of type double (numeric). V1 and V2 are used for grouping.

library(data.table)
# create data
n_week = 4L
n_V1 = 2L
# cross join
DT <- CJ(
  DATE = as.character(rev(seq(as.Date("2017-12-25"), length.out = n_week, by = "-1 week"))),
  V1 = seq_len(n_V1),
  V2 = LETTERS[1:2]
)
DT[order(V1, V2, DATE), V3 := as.numeric(seq_len(.N))][]
  DATE V1 V2 V3 1: 2017-12-04 1 A 1 2: 2017-12-04 1 B 5 3: 2017-12-04 2 A 9 4: 2017-12-04 2 B 13 5: 2017-12-11 1 A 2 6: 2017-12-11 1 B 6 7: 2017-12-11 2 A 10 8: 2017-12-11 2 B 14 9: 2017-12-18 1 A 3 10: 2017-12-18 1 B 7 11: 2017-12-18 2 A 11 12: 2017-12-18 2 B 15 13: 2017-12-25 1 A 4 14: 2017-12-25 1 B 8 15: 2017-12-25 2 A 12 16: 2017-12-25 2 B 16 

So I tried to solve your problem using two inner_joins from the dplyr package:

First I created an example data.frame (1.000.000 rows):

V3 <- seq(from=1, to=1000000, by =1 )
DATE <- seq(from=1, to= 7000000, by =7)
dt <- data.frame(V3, DATE)

Does it look correct? I dropped all unnecessary content and ignored the Date format (you can subtract Dates the same way as integers)

Next, I did two innerjoins on the DATE column but the second data.frame was containing the DATE +7 and DATE +14 so you join on the correct Dates. Finally, i select the 3 interesting columns and computed the rowMean. I took like 5 seconds on my lousy MacBook.

inner_join(
    inner_join(x= dt, y=mutate(dt, DATE=DATE+7), by= 'DATE'),
    y = mutate(dt, DATE= DATE+14), by= 'DATE')  %>% 
    select(V3 , V3.y, V3.x) %>% 
    rowMeans()

and if you want to add it to your dt remember that for the first 2 dates there is no average because no DATE-14 and DATE-7 exists.

dt$V4 <-   c(NA, NA, inner_join(
        inner_join(x= dt, y=mutate(dt, DATE=DATE+7), by= 'DATE'),
        y = mutate(dt, DATE= DATE+14), by= 'DATE')  %>% 
        select(V3 , V3.y, V3.x) %>% 
        rowMeans())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM