Google BigQuery 中的嵌套函数 window

Question

I want to count unique IDs over all timestamps in the past per timestamp in case the last value of the ID is greater than 0 in a google BigQuery SQL. I don't want to GROUP BY cause I need the whole table as output. Also the table has > 1 billion rows so the query should be efficient.我想计算每个时间戳过去所有时间戳的唯一 ID，以防 google BigQuery SQL 中 ID 的最后一个值大于 0。我不想GROUP BY因为我需要整个表作为 output。另外该表有 > 10 亿行，因此查询应该是高效的。

Imagine I have a table like this:想象一下我有一个这样的表：

| ID | value | timestamp  |
|:-- | ----- | ----------:|
| A  | 1     | 2021-01-01 |
| B  | 0     | 2021-01-01 |
| C  | 0     | 2021-01-01 |
| A  | 0     | 2021-01-02 |
| B  | 1     | 2021-01-02 |
| C  | 1     | 2021-01-03 |
| B  | 0     | 2021-01-04 |

the result should look like this:结果应该是这样的：

| ID | value | timestamp  | count_val_gt_0 |
|:-- | ----- | ---------- | --------------:|
| A  | 1     | 2021-01-01 | 1              |
| B  | 0     | 2021-01-01 | 1              |
| C  | 0     | 2021-01-01 | 1              |
| A  | 0     | 2021-01-02 | 1              |
| B  | 1     | 2021-01-02 | 1              |
| C  | 1     | 2021-01-03 | 2              |
| B  | 0     | 2021-01-04 | 1              |

explanation:解释：

timestamp  - set of unique IDs with last value > 0

2021-01-01: {A}
2021-02-01: {B}
2021-03-01: {B,C}
2021-04-01: {C}

For timestamp 2021-01-01 only A has a value greater 0. No timestamp before that.对于时间戳 2021-01-01，只有 A 的值大于 0。在此之前没有时间戳。 For all rows with timestamp 2021-01-02 I'm counting unique IDs in case last value of this ID is greater than 0 over the timestamps 2021-01-01 and 2021-01-02.对于时间戳为 2021-01-02 的所有行，我正在计算唯一 ID，以防此 ID 的最后一个值在时间戳 2021-01-01 和 2021-01-02 期间大于 0。 The last value of A is no longer greater than 0 but now B is. A 的最后一个值不再大于 0，但现在 B 大于 0。 For timestamp 2021-01-03 last value of B is still greater 0, now also last value of C, so I'm counting 2. For timestamp 2021-01-04 B is no longer greater 0, so its just C: 1.对于时间戳 2021-01-03，B 的最后一个值仍然大于 0，现在也是 C 的最后一个值，所以我数 2。对于时间戳 2021-01-04，B 不再大于 0，因此它只是 C：1 .

What I tried was following this approach (in "Nested value_of expression at row function") like so:我尝试的是遵循这种方法（在“行函数中的嵌套 value_of 表达式”中），如下所示：

I added a next_timestamp field, that displays the next occurrence of an ID and tried:我添加了一个next_timestamp字段，显示下一次出现的 ID 并尝试：

SELECT 
  id
, timestamp
, COUNT(DISTINCT CASE WHEN value > 0 AND NOT next_timestamp <= VALUE OF timestamp AT CURRENT_ROW THEN id END) OVER (PARTITION BY timestamp RANGE UNBOUNDED PRECEDING) as count_id_gt_0
FROM my_table

but in google BigQuery VALUE OF is not recognized: Syntax error: Unexpected keyword OF但在 google BigQuery 中无法识别VALUE OF ： Syntax error: Unexpected keyword OF

Here a query to work with:这里有一个查询：

WITH data AS (
  SELECT * FROM UNNEST([
    STRUCT
    ('A' as id,1 as value, 1 as time_stamp), 
    ('B', 0, 1),
    ('C', 0, 1),
    ('A', 0, 2),
    ('B', 1, 2),
    ('C', 1, 3),
    ('B', 0, 4)
  ])
),
final_table AS (
  SELECT
    id
  , value
  , time_stamp
  , LEAD(time_stamp,1) OVER (PARTITION BY id ORDER BY time_stamp) AS next_time
  FROM data
)
  SELECT 
    id
  , value
  , time_stamp
  , next_time
  , COUNT( CASE WHEN value > 0 AND NOT next_time <= ft.time_stamp THEN id END) OVER(
      ORDER BY time_stamp 
      RANGE UNBOUNDED PRECEDING
    ) AS id_gt_0_array
  FROM final_table ft

the result is still not as expected as the next_time <= ft.time_stamp is ignored:结果仍然不如预期，因为next_time <= ft.time_stamp被忽略了：

| id | value | timestamp  | id_gt_0        |
|:-- | ----- | ---------- | --------------:|
| A  | 1     | 2021-01-01 | 1              |
| B  | 0     | 2021-01-01 | 1              |
| C  | 0     | 2021-01-01 | 1              |
| A  | 0     | 2021-01-02 | 1              |
| B  | 1     | 2021-01-02 | 2              |
| C  | 1     | 2021-01-03 | 2              |
| B  | 0     | 2021-01-04 | 2              |

Update with solution:更新解决方案：

Based on the suggestion of @Mikhail Berlyant I got the right result which is also very fast with this query:根据@Mikhail Berlyant 的建议，我得到了正确的结果，这个查询也非常快：

select * except(new_value), 
  sum(new_value) over win as unique_ids
from (
  select *, 
    if(not lag(value) over by_id is null,
      if(lag(value) over by_id > 0,
        if(value > 0, 0, -1),
        if(value > 0, 1, 0)), 
      if(value > 0,1,0)
    ) new_value
  from final_table
  window by_id as (partition by id order by time_stamp)
)
window win as (order by time_stamp range between unbounded preceding and current row)

Thanks!谢谢！

Answer 1

Consider below approach考虑以下方法

select * except(new_value), 
  sum(new_value) over win as unique_ids
from (
  select *, 
    if(not lag(value) over by_id is null,
      if(lag(value) over by_id > 0, if(value = 0, -1, 0), 1), 
      value
    ) new_value
  from your_table
  window by_id as (partition by id order by timestamp)
)
window win as (order by timestamp range between unbounded preceding and current row)

with output output

Please note:请注意：

above is not tested and was written just as an example for alternative solution to address ">1 billion issue"以上未经测试，仅作为解决“> 10 亿问题”的替代解决方案的示例编写
while not fully tested - i did very quick one and looks like it works as expected and at least for dummy example in your question output is correct虽然没有经过全面测试 - 我做的非常快，看起来它按预期工作，至少对于你问题中的虚拟示例 output 是正确的
for small data, already proposed solution by Jaytiger is more effective.对于小数据，Jaytiger 已经提出的解决方案更有效。 but for really big/heavy cases like yours - I think this approach has good chances to be more effective但对于像你这样的大/重案例 - 我认为这种方法很有可能更有效

Answer 2

Hope this is helpful.希望这会有所帮助。 This query might not be scalable due to cumulative ARRAY_AGG ing.由于累积的 ARRAY_AGG ing，此查询可能无法扩展。

WITH data AS (
  SELECT * FROM UNNEST([
    STRUCT
    ('A' as id,1 as value, 1 as time_stamp), 
    ('B', 0, 1),
    ('C', 0, 1),
    ('A', 0, 2),
    ('B', 1, 2),
    ('C', 1, 3),
    ('B', 0, 4)
  ])
),
array_agg AS (
  SELECT *, ARRAY_AGG(d) OVER (ORDER BY time_stamp) arr FROM data d
)
SELECT * EXCEPT(arr), 
       (SELECT COUNTIF(latest_value_by_id > 0) FROM (
          SELECT ARRAY_AGG(i.value ORDER BY i.time_stamp DESC LIMIT 1)[SAFE_OFFSET(0)] latest_value_by_id
            FROM t.arr i GROUP BY i.id
       )) AS id_gt_0
  FROM array_agg t;

+----+-------+------------+---------+
| id | value | time_stamp | id_gt_0 |
+----+-------+------------+---------+
| A  |     1 |          1 |       1 |
| B  |     0 |          1 |       1 |
| C  |     0 |          1 |       1 |
| A  |     0 |          2 |       1 |
| B  |     1 |          2 |       1 |
| C  |     1 |          3 |       2 |
| B  |     0 |          4 |       1 |
+----+-------+------------+---------+

Google BigQuery 中的嵌套函数 window

问题描述

2 个解决方案

解决方案1
1 2022-09-29 18:39:53

解决方案2
0 2022-09-29 14:01:54

Google BigQuery 中的嵌套函数 window

问题描述

2 个解决方案

解决方案1 1 2022-09-29 18:39:53

解决方案2 0 2022-09-29 14:01:54

解决方案1
1 2022-09-29 18:39:53

解决方案2
0 2022-09-29 14:01:54