简体   繁体   中英

SQL query to count rows based on previous values of different column

I'm working in SAS and I have a table that looks like this

ID | Time | Main | lag_1 | lag_2
----------------------------------------------------------------------------
A  |  01  |   0  |   0   |  1  
A  |  03  |   0  |   0   |  1  
A  |  04  |   0  |   0   |  0  
A  |  10  |   1  |   0   |  0  
A  |  11  |   1  |   0   |  0  
A  |  12  |   1  |   0   |  0  
B  |  02  |   1  |   1   |  1  
B  |  04  |   0  |   1   |  1  
B  |  07  |   0  |   0   |  1  
B  |  10  |   1  |   0   |  0  
B  |  11  |   1  |   0   |  0  
B  |  12  |   1  |   0   |  0  

except with multiple IDs. The table is sorted by ID and Time. After calculating the total count of ones in the Main column (call it tot ), I am trying to calculate 2 things:

  1. The total count of ones in the Main column only if lag_1 has been equal to 1 at some time before Main became 1, say tot_1 ; and
  2. The same as 1. but in this case for lag_2, call the variable tot_2

The table of expected calculations would give me that

tot | tot_1 | tot_2
--------------------
 7  |   3   |   6

since tot_1 should be 3 (0 from ID = A + 3 from ID = B), and tot_2 should be 6 (3 from ID = A + 3 from ID = B).

I am a complete beginner in these types of segmentations so any help is greatly appreciated.

Edit: I would expect that tot_2 >= tot_1 because lag_2 is built on events from Main which go longer back in time than lag_1 does.

Much easier to do in a data step. That way you can check for start of new id and reset the flag for whether the lag_x variables were ever true.

data want ;
  set have end=eof;
  by id time ;
  tot + main ;
  if first.id then call missing(any_lag_1,any_lag_2);
  if any_lag_1 then tot_1 + main ;
  if any_lag_2 then tot_2 + main ;
  if eof then output;
  any_lag_1+lag_1;
  any_lag_2+lag_2;
  keep tot: ;
run;

If I understand correctly, you want these sums per id. The key is comparing the minimum value of the id under different circumstances, and then doing the sums. This is all conditional aggregation:

select sum(tot) as tot,
       sum(case when id_lag_1 < id_main then tot else 0 end) as tot_1,
       sum(case when id_lag_2 < id_main then tot else 0 end) as tot_2
from (select id, sum(main) as tot,
             min(case when main = 1 then id end) as id_main,
             min(case when lag_1 = 1 then id end) as id_lag_1,
             min(case when lag_2 = 1 then id end) as id_lag_2
      from t 
      group by id
     ) t;

Consider the computation for tot_1 and tot_2

My first step is to look for a pattern where lag_1 > main (This fulfills the case you mentioned that,ie find records where lag_1=1 sometime before main=1) and i name all such values as 'grp_lag_1' and 'grp_lag_2'

Once i have grouped the records, i "copy" down the values using max() over(order by id,time1).

select *
      ,max(case when lag_1 > main then 'grp_lag_1' end) over(partition by id order by id,time1) as grp_1 
      ,max(case when lag_2 > main then 'grp_lag_2' end) over(partition by id order by id,time1) as grp_2 
  from t

So i get a result as follows

+----+-------+------+-------+-------+-----------+-----------+
| id | time1 | main | lag_1 | lag_2 |   grp_1   |   grp_2   |
+----+-------+------+-------+-------+-----------+-----------+
| A  |    01 |    0 |     0 |     1 |           | grp_lag_2 |
| A  |    03 |    0 |     0 |     1 |           | grp_lag_2 |
| A  |    04 |    0 |     0 |     0 |           | grp_lag_2 |
| A  |    10 |    1 |     0 |     0 |           | grp_lag_2 |
| A  |    11 |    1 |     0 |     0 |           | grp_lag_2 |
| A  |    12 |    1 |     0 |     0 |           | grp_lag_2 |
| B  |    02 |    1 |     1 |     1 |           |           |
| B  |    04 |    0 |     1 |     1 | grp_lag_1 | grp_lag_2 |
| B  |    07 |    0 |     0 |     1 | grp_lag_1 | grp_lag_2 |
| B  |    10 |    1 |     0 |     0 | grp_lag_1 | grp_lag_2 |
| B  |    11 |    1 |     0 |     0 | grp_lag_1 | grp_lag_2 |
| B  |    12 |    1 |     0 |     0 | grp_lag_1 | grp_lag_2 |
+----+-------+------+-------+-------+-----------+-----------+

After this if i were to sumup the main values for grp_lag_1 i would get tot_1 and likewise summing up grp+lag_2 i would get tot_2

 select sum(main) as tot_cnt
       ,sum(case when grp_1='grp_lag_1' then main end) as tot_1
       ,sum(case when grp_2='grp_lag_2' then main end) as tot_2
 from(      
select *
      ,max(case when lag_1 > main then 'grp_lag_1' end) over(partition by id order by id,time1) as grp_1 
      ,max(case when lag_2 > main then 'grp_lag_2' end) over(partition by id order by id,time1) as grp_2 
  from t
  )x


+---------+-------+-------+
| tot_cnt | tot_1 | tot_2 |
+---------+-------+-------+
|       7 |     3 |     6 |
+---------+-------+-------+

Demo https://dbfiddle.uk/?rdbms=sqlserver_2012&fiddle=c17be111dbc3c516afa2bc3dcd3c9e1c

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM