简体   繁体   中英

Finding max of data.table column within range of rows from current observation by group

Ok so that title is quite a mouthful but here's the problem I solved and I was curious if anyone had a better solution or could generalize it further.

I have a time series as a data.table and I'm interested in finding out if an observation "bucks the trend" so to speak of the data before and after. Ie Is this observation larger than the year of observations before and after ?

To do this, my thought was to build in another column that grabs the max from the rows above or below and then just check if a row is equal to that max.

My data, luckily was regularly ordered, meaning that every row is the same distance of time from it's neighboring row. I use this fact to manually specify window size, rather than having to check if each row is within the time distance of interest.

#######################
# Package Loading
usePackage <- function(p) {
  if (!is.element(p, installed.packages()[,1]))
    install.packages(p, dep = TRUE)
  require(p, character.only = TRUE)
}

packages <- c("data.table","lubridate")
for(package in packages) usePackage(package)
rm(packages,usePackage)
#######################

set.seed(1337)

# creating a data.table
mydt <- data.table(Name = c(rep("Roger",12),rep("Johnny",8),"Mark"),
                   Date = c(seq(ymd('2010-06-15'),ymd('2015-12-15'), by = '6 month'),
                            seq(ymd('2012-06-15'),ymd('2015-12-15'), by = '6 month'),
                            ymd('2015-12-15')))

mydt[ , Value := c(rnorm(12,15,1),rnorm(8,30,2),rnorm(1,100,30))]
setkey(mydt, Name, Date)

# setting the number of rows up or down to check
windowSize <- 2

# applying the windowing max function
mydt[,
     windowMax := unlist(lapply(1:.N, function(x) max(.SD[Filter(function(y) y>0 & y <= .N, unique(abs(x+(-windowSize:windowSize)))), Value]))),
     by = Name]

# checking if a value is the local max (by window)
mydt[, isMaxValue := windowMax == Value]
mydt

As you can see, the windowing function is a mess but it does the trick. My question is: do you know a simpler, more succinct, or more readable way to do the same thing? Do you know how to generalize this to take irregular time series into account (ie not a fixed window)? I couldn't get the zoo::rollapply to do what I wanted but I don't have that much experience with it (I couldn't solve the problem of a group with 1 row causing the function to crash).

Let me know your thoughts and thank you!

This doesn't really address the time-window part, but if you want a one-liner with zoo::rollapply , you can do:

width <- 2 * windowSize + 1 # One central obs. and two on each side

mydt[, isMaxValue2 := rollapply(Value, width, max, partial = TRUE) == Value, by=Name]
identical(mydt$isMaxValue, mydt$isMaxValue2) # TRUE

It's somewhat more legible than your proposed solution, I think.

The partial = TRUE argument deals with the "border effects" when there are less than 5 observations in the window.

I think something like rollapply (@hfty's answer) makes more sense, but here's another way:

mydt[, wmax := do.call(pmax, c(
  shift(Value, 2:1, type = "lag"),
  shift(Value, 0:2, type = "lead"), 
  list(na.rm = TRUE)
)), by=Name]

which seems to work:

      Name                Date     Value windowMax      wmax
 1: Johnny 2012-06-14 20:00:00  30.31510  32.97827  32.97827
 2: Johnny 2012-12-14 19:00:00  32.97827  32.97827  32.97827
 3: Johnny 2013-06-14 20:00:00  29.84842  32.97827  32.97827
 4: Johnny 2013-12-14 19:00:00  32.54356  32.97827  32.97827
 5: Johnny 2014-06-14 20:00:00  31.28335  33.72532  33.72532
 6: Johnny 2014-12-14 19:00:00  31.60152  33.72532  33.72532
 7: Johnny 2015-06-14 20:00:00  33.72532  33.72532  33.72532
 8: Johnny 2015-12-14 19:00:00  28.90929  33.72532  33.72532
 9:   Mark 2015-12-14 19:00:00 118.57833 118.57833 118.57833
10:  Roger 2010-06-14 20:00:00  15.19249  15.19249  15.19249
11:  Roger 2010-12-14 19:00:00  13.55330  16.62230  16.62230
12:  Roger 2011-06-14 20:00:00  14.67682  16.62230  16.62230
13:  Roger 2011-12-14 19:00:00  16.62230  17.04212  17.04212
14:  Roger 2012-06-14 20:00:00  14.31098  17.04212  17.04212
15:  Roger 2012-12-14 19:00:00  17.04212  17.08193  17.08193
16:  Roger 2013-06-14 20:00:00  15.94378  17.08193  17.08193
17:  Roger 2013-12-14 19:00:00  17.08193  17.08193  17.08193
18:  Roger 2014-06-14 20:00:00  16.91712  17.08193  17.08193
19:  Roger 2014-12-14 19:00:00  14.58519  17.08193  17.08193
20:  Roger 2015-06-14 20:00:00  16.03285  16.91712  16.91712
21:  Roger 2015-12-14 19:00:00  13.32143  16.03285  16.03285
      Name                Date     Value windowMax      wmax

To see how it works, one could look at the vectors before the pmax is taken:

mydt[, c(
  shift(Value, 2:1, type = "lag"),
  shift(Value, 0:2, type = "lead")
), by=Name]


 #      Name       V1       V2        V3       V4       V5
 # 1: Johnny       NA       NA  30.31510 32.97827 29.84842
 # 2: Johnny       NA 30.31510  32.97827 29.84842 32.54356
 # 3: Johnny 30.31510 32.97827  29.84842 32.54356 31.28335
 # 4: Johnny 32.97827 29.84842  32.54356 31.28335 31.60152
 # 5: Johnny 29.84842 32.54356  31.28335 31.60152 33.72532
 # 6: Johnny 32.54356 31.28335  31.60152 33.72532 28.90929
 # 7: Johnny 31.28335 31.60152  33.72532 28.90929       NA
 # 8: Johnny 31.60152 33.72532  28.90929       NA       NA
 # 9:   Mark       NA       NA 118.57833       NA       NA
# 10:  Roger       NA       NA  15.19249 13.55330 14.67682
# 11:  Roger       NA 15.19249  13.55330 14.67682 16.62230
# 12:  Roger 15.19249 13.55330  14.67682 16.62230 14.31098
# 13:  Roger 13.55330 14.67682  16.62230 14.31098 17.04212
# 14:  Roger 14.67682 16.62230  14.31098 17.04212 15.94378
# 15:  Roger 16.62230 14.31098  17.04212 15.94378 17.08193
# 16:  Roger 14.31098 17.04212  15.94378 17.08193 16.91712
# 17:  Roger 17.04212 15.94378  17.08193 16.91712 14.58519
# 18:  Roger 15.94378 17.08193  16.91712 14.58519 16.03285
# 19:  Roger 17.08193 16.91712  14.58519 16.03285 13.32143
# 20:  Roger 16.91712 14.58519  16.03285 13.32143       NA
# 21:  Roger 14.58519 16.03285  13.32143       NA       NA
#       Name       V1       V2        V3       V4       V5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM