简体   繁体   中英

Nested window functions in Google BigQuery

I want to count unique IDs over all timestamps in the past per timestamp in case the last value of the ID is greater than 0 in a google BigQuery SQL. I don't want to GROUP BY cause I need the whole table as output. Also the table has > 1 billion rows so the query should be efficient.

Imagine I have a table like this:

| ID | value | timestamp  |
|:-- | ----- | ----------:|
| A  | 1     | 2021-01-01 |
| B  | 0     | 2021-01-01 |
| C  | 0     | 2021-01-01 |
| A  | 0     | 2021-01-02 |
| B  | 1     | 2021-01-02 |
| C  | 1     | 2021-01-03 |
| B  | 0     | 2021-01-04 |

the result should look like this:

| ID | value | timestamp  | count_val_gt_0 |
|:-- | ----- | ---------- | --------------:|
| A  | 1     | 2021-01-01 | 1              |
| B  | 0     | 2021-01-01 | 1              |
| C  | 0     | 2021-01-01 | 1              |
| A  | 0     | 2021-01-02 | 1              |
| B  | 1     | 2021-01-02 | 1              |
| C  | 1     | 2021-01-03 | 2              |
| B  | 0     | 2021-01-04 | 1              |

explanation:

timestamp  - set of unique IDs with last value > 0

2021-01-01: {A}
2021-02-01: {B}
2021-03-01: {B,C}
2021-04-01: {C}

For timestamp 2021-01-01 only A has a value greater 0. No timestamp before that. For all rows with timestamp 2021-01-02 I'm counting unique IDs in case last value of this ID is greater than 0 over the timestamps 2021-01-01 and 2021-01-02. The last value of A is no longer greater than 0 but now B is. For timestamp 2021-01-03 last value of B is still greater 0, now also last value of C, so I'm counting 2. For timestamp 2021-01-04 B is no longer greater 0, so its just C: 1.

What I tried was following this approach (in "Nested value_of expression at row function") like so:

I added a next_timestamp field, that displays the next occurrence of an ID and tried:

SELECT 
  id
, timestamp
, COUNT(DISTINCT CASE WHEN value > 0 AND NOT next_timestamp <= VALUE OF timestamp AT CURRENT_ROW THEN id END) OVER (PARTITION BY timestamp RANGE UNBOUNDED PRECEDING) as count_id_gt_0
FROM my_table

but in google BigQuery VALUE OF is not recognized: Syntax error: Unexpected keyword OF

Here a query to work with:

WITH data AS (
  SELECT * FROM UNNEST([
    STRUCT
    ('A' as id,1 as value, 1 as time_stamp), 
    ('B', 0, 1),
    ('C', 0, 1),
    ('A', 0, 2),
    ('B', 1, 2),
    ('C', 1, 3),
    ('B', 0, 4)
  ])
),
final_table AS (
  SELECT
    id
  , value
  , time_stamp
  , LEAD(time_stamp,1) OVER (PARTITION BY id ORDER BY time_stamp) AS next_time
  FROM data
)
  SELECT 
    id
  , value
  , time_stamp
  , next_time
  , COUNT( CASE WHEN value > 0 AND NOT next_time <= ft.time_stamp THEN id END) OVER(
      ORDER BY time_stamp 
      RANGE UNBOUNDED PRECEDING
    ) AS id_gt_0_array
  FROM final_table ft 

the result is still not as expected as the next_time <= ft.time_stamp is ignored:

| id | value | timestamp  | id_gt_0        |
|:-- | ----- | ---------- | --------------:|
| A  | 1     | 2021-01-01 | 1              |
| B  | 0     | 2021-01-01 | 1              |
| C  | 0     | 2021-01-01 | 1              |
| A  | 0     | 2021-01-02 | 1              |
| B  | 1     | 2021-01-02 | 2              |
| C  | 1     | 2021-01-03 | 2              |
| B  | 0     | 2021-01-04 | 2              |

Update with solution:

Based on the suggestion of @Mikhail Berlyant I got the right result which is also very fast with this query:

select * except(new_value), 
  sum(new_value) over win as unique_ids
from (
  select *, 
    if(not lag(value) over by_id is null,
      if(lag(value) over by_id > 0,
        if(value > 0, 0, -1),
        if(value > 0, 1, 0)), 
      if(value > 0,1,0)
    ) new_value
  from final_table
  window by_id as (partition by id order by time_stamp)
)
window win as (order by time_stamp range between unbounded preceding and current row) 

Thanks!

Consider below approach

select * except(new_value), 
  sum(new_value) over win as unique_ids
from (
  select *, 
    if(not lag(value) over by_id is null,
      if(lag(value) over by_id > 0, if(value = 0, -1, 0), 1), 
      value
    ) new_value
  from your_table
  window by_id as (partition by id order by timestamp)
)
window win as (order by timestamp range between unbounded preceding and current row)       

with output

在此处输入图像描述

Please note:

  1. above is not tested and was written just as an example for alternative solution to address ">1 billion issue"
  2. while not fully tested - i did very quick one and looks like it works as expected and at least for dummy example in your question output is correct
  3. for small data, already proposed solution by Jaytiger is more effective. but for really big/heavy cases like yours - I think this approach has good chances to be more effective

Hope this is helpful. This query might not be scalable due to cumulative ARRAY_AGG ing.

WITH data AS (
  SELECT * FROM UNNEST([
    STRUCT
    ('A' as id,1 as value, 1 as time_stamp), 
    ('B', 0, 1),
    ('C', 0, 1),
    ('A', 0, 2),
    ('B', 1, 2),
    ('C', 1, 3),
    ('B', 0, 4)
  ])
),
array_agg AS (
  SELECT *, ARRAY_AGG(d) OVER (ORDER BY time_stamp) arr FROM data d
)
SELECT * EXCEPT(arr), 
       (SELECT COUNTIF(latest_value_by_id > 0) FROM (
          SELECT ARRAY_AGG(i.value ORDER BY i.time_stamp DESC LIMIT 1)[SAFE_OFFSET(0)] latest_value_by_id
            FROM t.arr i GROUP BY i.id
       )) AS id_gt_0
  FROM array_agg t;

+----+-------+------------+---------+
| id | value | time_stamp | id_gt_0 |
+----+-------+------------+---------+
| A  |     1 |          1 |       1 |
| B  |     0 |          1 |       1 |
| C  |     0 |          1 |       1 |
| A  |     0 |          2 |       1 |
| B  |     1 |          2 |       1 |
| C  |     1 |          3 |       2 |
| B  |     0 |          4 |       1 |
+----+-------+------------+---------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM