简体   繁体   English

计算符合条件的前一行

[英]Count preceding rows that match criteria

I am working time series data and I need to count the number of rows preceding the current row that matched a condition. 我正在处理时间序列数据,我需要计算匹配条件的当前行之前的行数。 For example, I need to know how many months prior to the row's month and customer had sales (NETSALES > 0). 例如,我需要知道行的月份前几个月,并且客户有销售额(NETSALES> 0)。 Ideally I would maintain a row counter that resets when the condition fails (eg NETSALES = 0). 理想情况下,我会维护一个行计数器,当条件失败时重置(例如NETSALES = 0)。

Another way of solving the problem would be to flag any row that had more than 12 prior periods of NETSALES. 解决问题的另一种方法是标记任何具有超过12个NETSALES前期的行。

The closest I came was using the 我最接近的是使用

COUNT(*) 
OVER (PARTITION BY cust ORDER BY dt
  ROWS 12 PRECEDING) as CtWindow,

http://sqlfiddle.com/#!6/990eb/2 http://sqlfiddle.com/#!6/990eb/2

In the example above, 201310 is correctly flagged as 12 but ideally the previous row would have been 11. 在上面的示例中,201310被正确标记为12,但理想情况下,前一行将为11。

The solution can be in R or T-SQL. 解决方案可以是R或T-SQL。

Updated with data.table example : 更新了data.table示例

library(data.table)
set.seed(50)
DT <- data.table(NETSALES=ifelse(runif(40)<.15,0,runif(40,1,100)), cust=rep(1:2, each=20), dt=1:20)

The goal is to calculate a "run" column like below -- which gets reset to zero when the value is zero 目标是计算如下所示的“运行”列 - 当值为零时,该列将重置为零

     NETSALES cust dt run
 1: 36.956464    1  1   1
 2: 83.767621    1  2   2
 3: 28.585003    1  3   3
 4: 10.250524    1  4   4
 5:  6.537188    1  5   5
 6:  0.000000    1  6   6
 7: 95.489944    1  7   7
 8: 46.351387    1  8   8
 9:  0.000000    1  9   0 
10:  0.000000    1 10   0
11: 99.621881    1 11  1
12: 76.755104    1 12  2
13: 64.288721    1 13  3
14:  0.000000    1 14  0 
15: 36.504473    1 15  1 
16: 43.157142    1 16  2 
17: 71.808349    1 17  3 
18: 53.039105    1 18  4 
19:  0.000000    1 19  0
20: 27.387369    1 20  1 
21: 58.308899    2  1   1
22: 65.929296    2  2   2
23: 20.529473    2  3   3
24: 58.970898    2  4   4
25: 13.785201    2  5   5
26:  4.796752    2  6   6
27: 72.758112    2  7   7
28:  7.088647    2  8   8
29: 14.516362    2  9   9
30: 94.470714    2 10  10
31: 51.254178    2 11  11
32: 99.544261    2 12  12
33: 66.475412    2 13  13
34:  8.362936    2 14  14
35: 96.742115    2 15  15
36: 15.677712    2 16  16
37:  0.000000    2 17  0
38: 95.684652    2 18  1
39: 65.639292    2 19  2
40: 95.721081    2 20  3
     NETSALES cust dt run

This seems to do it: 这似乎是这样做的:

library(data.table)
set.seed(50)
DT <- data.table(NETSALES=ifelse(runif(40)<.15,0,runif(40,1,100)), cust=rep(1:2, each=20), dt=1:20)
DT[,dir:=ifelse(NETSALES>0,1,0)]
dir.rle <- rle(DT$dir)
DT <- transform(DT, indexer = rep(1:length(dir.rle$lengths), dir.rle$lengths))
DT[,runl:=cumsum(dir),by=indexer]

credit to Cumulative sums over run lengths. 贷款累计金额超过运行长度。 Can this loop be vectorized? 这个循环可以被矢量化吗?


Edit by Roland: 罗兰编辑:

Here is the same with better performance and also considering different customers: 以下是更好的性能,也考虑到不同的客户:

#no need for ifelse
DT[,dir:= NETSALES>0]

#use a function to avoid storing the rle, which could be huge
runseq <- function(x) {
  x.rle <- rle(x)
  rep(1:length(x.rle$lengths), x.rle$lengths)
}

#never use transform with data.table
DT[,indexer := runseq(dir)]

#include cust in by
DT[,runl:=cumsum(dir),by=list(indexer,cust)]

Edit: joe added SQL solution http://sqlfiddle.com/#!6/990eb/22 编辑:joe添加了SQL解决方案http://sqlfiddle.com/#!6/990eb/22

SQL solution is 48 minutes on a machine with 128gig of ram across 22m rows. 在一台机器上,SQL解决方案是48分钟,在22米行上有128克的内存。 R solution is about 20 seconds on a workstation with 4 gig of ram. 在具有4 gig ram的工作站上,R解决方案约为20秒。 Go R! 去R!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM