I am reading in an extremely large dataset as a data.table
for speed. The relevant columns are DATE
(weekly data in year-month-day strings eg "2017-12-25"), V1
(Integer), V2
(String), V3
(Numeric). I would like to produce V4
which is the moving average of V3
, for the last 3 weeks ( DATE
, DATE
-7, and DATE
-14) here is a naive attempt/solution, which is terribly inefficient:
dt <- fread("largefile.csv")
dt$DATE <- as.IDate(dt$DATE) //convert dates to date format
V1_list <- sort(unique(dt$V1))
V2_list <- sort(unique(dt$V2))
DATE_list <- sort(unique(dt$DATE))
for(i in 1:length(V1_list)){
for(j in 1:length(V2_list)){
for(k in 3:length(DATE_list){
dt[which(dt$V1 == V1_list[i] && dt$V2 == V2_list[j] && dt$DATE == DATE_list[k]),"V4"]
<- mean(dt[which(dt$V1 == V1_list[i] && dt$V2 == V2_list[j] && dt$DATE %in% DATE_list[k-2:k]),"V3"])
}
}
}
I am avoiding using plyr
partly due to computational constraints given the 50M rows I'm using. I have investigated options with setkey()
and zoo
/ rolling functions but I am unable to figure out how to layer in the date component (assuming I group by V1
, V2
and average V3
). Apologies for not providing sample code.
The OP has requested to append a new column which is the rolling average of V3
over the past 3 weeks grouped by V1
and V2
for a data.table
of 50 M rows.
If the DATE
values are without gap , ie, without missing weeks in all groups, one possible approach is to use the rollmeanr()
function from the zoo
package:
DT[order(DATE), V4 := zoo::rollmeanr(V3, 3L, fill = NA), by = .(V1, V2)]
DT[order(V1, V2, DATE)]
DATE V1 V2 V3 V4 1: 2017-12-04 1 A 1 NA 2: 2017-12-11 1 A 2 NA 3: 2017-12-18 1 A 3 2 4: 2017-12-25 1 A 4 3 5: 2017-12-04 1 B 5 NA 6: 2017-12-11 1 B 6 NA 7: 2017-12-18 1 B 7 6 8: 2017-12-25 1 B 8 7 9: 2017-12-04 2 A 9 NA 10: 2017-12-11 2 A 10 NA 11: 2017-12-18 2 A 11 10 12: 2017-12-25 2 A 12 11 13: 2017-12-04 2 B 13 NA 14: 2017-12-11 2 B 14 NA 15: 2017-12-18 2 B 15 14 16: 2017-12-25 2 B 16 15
Note that the NA
s are purposefully introduced because we do not have DATE
-7 and DATE
-14 values for the first two rows within each group.
Also note that this approach does not require type conversion of the character dates.
According to OP's description, the data.table
has 4 columns: DATE
are weekly character dates in standard unambiguous format %Y-%m-%d
, V1
is of type integer, V2
is of type character, and V3
is of type double (numeric). V1
and V2
are used for grouping.
library(data.table)
# create data
n_week = 4L
n_V1 = 2L
# cross join
DT <- CJ(
DATE = as.character(rev(seq(as.Date("2017-12-25"), length.out = n_week, by = "-1 week"))),
V1 = seq_len(n_V1),
V2 = LETTERS[1:2]
)
DT[order(V1, V2, DATE), V3 := as.numeric(seq_len(.N))][]
DATE V1 V2 V3 1: 2017-12-04 1 A 1 2: 2017-12-04 1 B 5 3: 2017-12-04 2 A 9 4: 2017-12-04 2 B 13 5: 2017-12-11 1 A 2 6: 2017-12-11 1 B 6 7: 2017-12-11 2 A 10 8: 2017-12-11 2 B 14 9: 2017-12-18 1 A 3 10: 2017-12-18 1 B 7 11: 2017-12-18 2 A 11 12: 2017-12-18 2 B 15 13: 2017-12-25 1 A 4 14: 2017-12-25 1 B 8 15: 2017-12-25 2 A 12 16: 2017-12-25 2 B 16
So I tried to solve your problem using two inner_joins from the dplyr package:
First I created an example data.frame (1.000.000 rows):
V3 <- seq(from=1, to=1000000, by =1 )
DATE <- seq(from=1, to= 7000000, by =7)
dt <- data.frame(V3, DATE)
Does it look correct? I dropped all unnecessary content and ignored the Date format (you can subtract Dates the same way as integers)
Next, I did two innerjoins on the DATE column but the second data.frame was containing the DATE +7 and DATE +14 so you join on the correct Dates. Finally, i select the 3 interesting columns and computed the rowMean. I took like 5 seconds on my lousy MacBook.
inner_join(
inner_join(x= dt, y=mutate(dt, DATE=DATE+7), by= 'DATE'),
y = mutate(dt, DATE= DATE+14), by= 'DATE') %>%
select(V3 , V3.y, V3.x) %>%
rowMeans()
and if you want to add it to your dt remember that for the first 2 dates there is no average because no DATE-14 and DATE-7 exists.
dt$V4 <- c(NA, NA, inner_join(
inner_join(x= dt, y=mutate(dt, DATE=DATE+7), by= 'DATE'),
y = mutate(dt, DATE= DATE+14), by= 'DATE') %>%
select(V3 , V3.y, V3.x) %>%
rowMeans())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.