简体   繁体   中英

How to calculate moving average by specified grouping and deal with NAs

I have a data.table which needs a moving average to be calculated on the previous n days of data (let's use n=2 for simplicity, not incl. current day) for a specified grouping (ID1, ID2). The moving average should attempt to include the last 2 days of values for each ID1-ID2 pair. I would like to calculate moving average to handle NAs two separate ways: 1. Only calculate when there are 2 non-NA observations, otherwise avg should be NA (eg first 2 days within an ID1-ID2 will always have NAs). 2. Calculate the moving average based on any non-NA observations within the last 2 days (na.rm=TRUE ?).

I've tried to use the zoo package and various functions within it. I've settled on the following (used shift() to exclude the week considered in the avg, put dates in reverse order to highlight dates are not always ordered initially):

library(zoo)
library(data.table)
DATE = rev(rep(seq(as.Date("2018-01-01"),as.Date("2018-01-04"),"day"),4))
VALUE =seq(1,16,1)
VALUE[16] <- NA
ID1 = rep(c("A","B"),each=8)
ID2 = rep(1:2,2,each=4)
testdata = data.frame (DATE, ID1, ID2, VALUE)
setDT(testdata)[order(DATE), VALUE_AVG := shift(rollapplyr(VALUE, 2, mean, 
na.rm=TRUE,fill = NA)), by = c("ID1", "ID2")]

I seem to have trouble grouping by multiple columns. Groupings where VALUE begins/ends with NA values also seem to cause issues. I'm open to any solutions which make sense within a data.table framework, especially frollmean (need to update my versions of R + data.table). I don't know if I need to order the dates differently in conjunction with a specified alignment (eg "right").

I would hope my output would look something like the following except ordered by oldest date first per ID1-ID2 grouping:

           DATE ID1 ID2 VALUE VALUE_AVG
 1: 2018-01-04   A   1     1       2.5
 2: 2018-01-03   A   1     2       3.5
 3: 2018-01-02   A   1     3        NA
 4: 2018-01-01   A   1     4        NA
 5: 2018-01-04   A   2     5       6.5
 6: 2018-01-03   A   2     6       7.5
 7: 2018-01-02   A   2     7        NA
 8: 2018-01-01   A   2     8        NA
 9: 2018-01-04   B   1     9      10.5
10: 2018-01-03   B   1    10      11.5
11: 2018-01-02   B   1    11        NA
12: 2018-01-01   B   1    12        NA
13: 2018-01-04   B   2    13      14.5
14: 2018-01-03   B   2    14      15.0
15: 2018-01-02   B   2    15        NA
16: 2018-01-01   B   2    NA        NA

My code seems to roughly achieve the desired results for the sample data. Nevertheless, when trying to run the same code on large dataset for a 4-week average where ID1 and ID2 are both integers, I get the following error:

Error in seq.default(start.at, NROW(data), by = by) : 
  wrong sign in 'by' argument

My results seem right for most ID1-ID2 combinations but there are specific cases of ID1 where VALUE has leading and trailing NAs. I'm guessing this is causing the issue, although it hasn't for the example above.

Using shift complicates this unnecessarily. rollapply already can handle that itself. In rollapplyr specify:

  • a width of list(-seq(2)) to specify that it should act on offsets -1 and -2.

  • partial = TRUE to indicate that if there are fewer than 2 prior rows it will use whatever is there.

  • fill = NA to fill empty cells with NA

  • na.rm = TRUE to remove any NAs and only perform the mean on the remaining cells. If the prior cells are all NA then mean gives NaN.

To only consider situations where there are 2 prior non-NAs giving NA otherwise remove the partial = TRUE and na.rm = TRUE arguments.

First case

Take mean of non-NAs in prior 2 rows or fewer rows if fewer prior rows.

testdata <- data.table(DATE, ID1, ID2, VALUE, key = c("ID1", "ID2", "DATE"))
testdata[, VALUE_AVG := 
  rollapplyr(VALUE, list(-seq(2)), mean, fill = NA, partial = TRUE, na.rm = TRUE),
  by = c("ID1", "ID2")]
testdata

giving:

          DATE ID1 ID2 VALUE VALUE_AVG
 1: 2018-01-01   A   1     4        NA
 2: 2018-01-02   A   1     3       4.0
 3: 2018-01-03   A   1     2       3.5
 4: 2018-01-04   A   1     1       2.5
 5: 2018-01-01   A   2     8        NA
 6: 2018-01-02   A   2     7       8.0
 7: 2018-01-03   A   2     6       7.5
 8: 2018-01-04   A   2     5       6.5
 9: 2018-01-01   B   1    12        NA
10: 2018-01-02   B   1    11      12.0
11: 2018-01-03   B   1    10      11.5
12: 2018-01-04   B   1     9      10.5
13: 2018-01-01   B   2    NA        NA
14: 2018-01-02   B   2    15       NaN
15: 2018-01-03   B   2    14      15.0
16: 2018-01-04   B   2    13      14.5

Second case

NA if any of the prior 2 rows are NA or if there are fewer than 2 prior rows.

testdata <- data.table(DATE, ID1, ID2, VALUE, key = c("ID1", "ID2", "DATE"))
testdata[, VALUE_AVG := 
  rollapplyr(VALUE, list(-seq(2)), mean, fill = NA),
  by = c("ID1", "ID2")]
testdata

giving:

          DATE ID1 ID2 VALUE VALUE_AVG
 1: 2018-01-01   A   1     4        NA
 2: 2018-01-02   A   1     3        NA
 3: 2018-01-03   A   1     2       3.5
 4: 2018-01-04   A   1     1       2.5
 5: 2018-01-01   A   2     8        NA
 6: 2018-01-02   A   2     7        NA
 7: 2018-01-03   A   2     6       7.5
 8: 2018-01-04   A   2     5       6.5
 9: 2018-01-01   B   1    12        NA
10: 2018-01-02   B   1    11        NA
11: 2018-01-03   B   1    10      11.5
12: 2018-01-04   B   1     9      10.5
13: 2018-01-01   B   2    NA        NA
14: 2018-01-02   B   2    15        NA
15: 2018-01-03   B   2    14        NA
16: 2018-01-04   B   2    13      14.5

Maybe something like:

setorder(setDT(testdata), ID1, ID2, DATE)
testdata[order(DATE), VALUE_AVG := shift(
        rollapplyr(VALUE, 2L, function(x) if(sum(!is.na(x)) > 0L) mean(x, na.rm=TRUE), fill = NA_real_)
    ), by = c("ID1", "ID2")]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM