I am working time series data and I need to count the number of rows preceding the current row that matched a condition. For example, I need to know how many months prior to the row's month and customer had sales (NETSALES > 0). Ideally I would maintain a row counter that resets when the condition fails (eg NETSALES = 0).
Another way of solving the problem would be to flag any row that had more than 12 prior periods of NETSALES.
The closest I came was using the
COUNT(*)
OVER (PARTITION BY cust ORDER BY dt
ROWS 12 PRECEDING) as CtWindow,
http://sqlfiddle.com/#!6/990eb/2
In the example above, 201310 is correctly flagged as 12 but ideally the previous row would have been 11.
The solution can be in R or T-SQL.
Updated with data.table example :
library(data.table)
set.seed(50)
DT <- data.table(NETSALES=ifelse(runif(40)<.15,0,runif(40,1,100)), cust=rep(1:2, each=20), dt=1:20)
The goal is to calculate a "run" column like below -- which gets reset to zero when the value is zero
NETSALES cust dt run
1: 36.956464 1 1 1
2: 83.767621 1 2 2
3: 28.585003 1 3 3
4: 10.250524 1 4 4
5: 6.537188 1 5 5
6: 0.000000 1 6 6
7: 95.489944 1 7 7
8: 46.351387 1 8 8
9: 0.000000 1 9 0
10: 0.000000 1 10 0
11: 99.621881 1 11 1
12: 76.755104 1 12 2
13: 64.288721 1 13 3
14: 0.000000 1 14 0
15: 36.504473 1 15 1
16: 43.157142 1 16 2
17: 71.808349 1 17 3
18: 53.039105 1 18 4
19: 0.000000 1 19 0
20: 27.387369 1 20 1
21: 58.308899 2 1 1
22: 65.929296 2 2 2
23: 20.529473 2 3 3
24: 58.970898 2 4 4
25: 13.785201 2 5 5
26: 4.796752 2 6 6
27: 72.758112 2 7 7
28: 7.088647 2 8 8
29: 14.516362 2 9 9
30: 94.470714 2 10 10
31: 51.254178 2 11 11
32: 99.544261 2 12 12
33: 66.475412 2 13 13
34: 8.362936 2 14 14
35: 96.742115 2 15 15
36: 15.677712 2 16 16
37: 0.000000 2 17 0
38: 95.684652 2 18 1
39: 65.639292 2 19 2
40: 95.721081 2 20 3
NETSALES cust dt run
This seems to do it:
library(data.table)
set.seed(50)
DT <- data.table(NETSALES=ifelse(runif(40)<.15,0,runif(40,1,100)), cust=rep(1:2, each=20), dt=1:20)
DT[,dir:=ifelse(NETSALES>0,1,0)]
dir.rle <- rle(DT$dir)
DT <- transform(DT, indexer = rep(1:length(dir.rle$lengths), dir.rle$lengths))
DT[,runl:=cumsum(dir),by=indexer]
credit to Cumulative sums over run lengths. Can this loop be vectorized?
Edit by Roland:
Here is the same with better performance and also considering different customers:
#no need for ifelse
DT[,dir:= NETSALES>0]
#use a function to avoid storing the rle, which could be huge
runseq <- function(x) {
x.rle <- rle(x)
rep(1:length(x.rle$lengths), x.rle$lengths)
}
#never use transform with data.table
DT[,indexer := runseq(dir)]
#include cust in by
DT[,runl:=cumsum(dir),by=list(indexer,cust)]
Edit: joe added SQL solution http://sqlfiddle.com/#!6/990eb/22
SQL solution is 48 minutes on a machine with 128gig of ram across 22m rows. R solution is about 20 seconds on a workstation with 4 gig of ram. Go R!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.