简体   繁体   中英

dplyr lag, with number of lags varying by observation

I have data of the following form. (also see below)

The first variable V1 describes the type of observation- those labelled H give information for all values below until another H observation. The second variable, for all other observations, says how many observations above is the H that describes the variable.

I want to compute a variable that is the lagged value of the third variable V3, where the number of lags is equal to the value of variable V2. Is there a way to use the lag() function in dplyr where I can feed it a number n of lags that changes with the row of my data set? Any other way to compute a lag that changes with the row? Ideally I would like to use methods from dplyr since I am applying this to large data for which sparklyr is useful.

Edit: output of dput of the first 25 observations of my data:

structure(list(V1 = c("H", "L", "S", "S", "S", "S", "S", "S", 
"S", "S", "H", "L", "S", "S", "S", "S", "S", "S", "S", "S", "H", 
"L", "S", "S", "S"), V2 = c(723L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 723L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 723L, 1L, 2L, 
3L, 4L), V3 = c("O2:CH\\A20\\7250.0", "1574721653.7385", "1574721662.9515", 
"1574729100.1103", "1574729880.0359", "1574730000.0388", "1574757000.0533", 
"1574757240.0648", "1574757300.0335", "1574757360.039", "O2:CH\\A20\\7300.0", 
"1574721653.7385", "1574721662.9515", "1574729100.1103", "1574729880.0359", 
"1574730000.0388", "1574757000.0533", "1574757240.0648", "1574757300.0335", 
"1574757360.039", "O2:CH\\A20\\7350.0", "1574721653.7385", "1574721662.9515", 
"1574729100.1103", "1574729880.0359"), V4 = c("USD", "1", "0", 
"2", "3", "5", "9", "3", "12", "13", "USD", "1", "0", "2", "3", 
"5", "9", "3", "12", "13", "USD", "1", "0", "2", "3"), V5 = c("6", 
"", "", "", "", "", "", "", "", "", "6", "", "", "", "", "", 
"", "", "", "", "6", "", "", "", "")), row.names = c(NA, 25L), class = "data.frame")

lag can only accept a scalar value but here's a little trick that we can use:

library(dplyr)

df %>%
  mutate(inds = row_number() - lag, 
         c = a[replace(inds, inds < 1, NA)])

#    a lag inds  c
#1   1   3   -2 NA
#2   2   3   -1 NA
#3   3   3    0 NA
#4   4   2    2  2
#5   5   3    2  2
#6   6   2    4  4
#7   7   2    5  5
#8   8   2    6  6
#9   9   3    6  6
#10 10   1    9  9

c is the final output we are looking at. Basically, we subtract the current row number with the lag value to get index which is used to subset a values.

data

set.seed(123)
df <- data.frame(a = 1:10, lag = sample(3, 10, replace = TRUE))
df
#    a lag
#1   1   3
#2   2   3
#3   3   3
#4   4   2
#5   5   3
#6   6   2
#7   7   2
#8   8   2
#9   9   3
#10 10   1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM