簡體   English   中英

Google BigQuery 中的嵌套函數 window

[英]Nested window functions in Google BigQuery

我想計算每個時間戳過去所有時間戳的唯一 ID,以防 google BigQuery SQL 中 ID 的最后一個值大於 0。我不想GROUP BY因為我需要整個表作為 output。另外該表有 > 10 億行,因此查詢應該是高效的。

想象一下我有一個這樣的表:

| ID | value | timestamp  |
|:-- | ----- | ----------:|
| A  | 1     | 2021-01-01 |
| B  | 0     | 2021-01-01 |
| C  | 0     | 2021-01-01 |
| A  | 0     | 2021-01-02 |
| B  | 1     | 2021-01-02 |
| C  | 1     | 2021-01-03 |
| B  | 0     | 2021-01-04 |

結果應該是這樣的:

| ID | value | timestamp  | count_val_gt_0 |
|:-- | ----- | ---------- | --------------:|
| A  | 1     | 2021-01-01 | 1              |
| B  | 0     | 2021-01-01 | 1              |
| C  | 0     | 2021-01-01 | 1              |
| A  | 0     | 2021-01-02 | 1              |
| B  | 1     | 2021-01-02 | 1              |
| C  | 1     | 2021-01-03 | 2              |
| B  | 0     | 2021-01-04 | 1              |

解釋:

timestamp  - set of unique IDs with last value > 0

2021-01-01: {A}
2021-02-01: {B}
2021-03-01: {B,C}
2021-04-01: {C}

對於時間戳 2021-01-01,只有 A 的值大於 0。在此之前沒有時間戳。 對於時間戳為 2021-01-02 的所有行,我正在計算唯一 ID,以防此 ID 的最后一個值在時間戳 2021-01-01 和 2021-01-02 期間大於 0。 A 的最后一個值不再大於 0,但現在 B 大於 0。 對於時間戳 2021-01-03,B 的最后一個值仍然大於 0,現在也是 C 的最后一個值,所以我數 2。對於時間戳 2021-01-04,B 不再大於 0,因此它只是 C:1 .

我嘗試的是遵循這種方法(在“行函數中的嵌套 value_of 表達式”中),如下所示:

我添加了一個next_timestamp字段,顯示下一次出現的 ID 並嘗試:

SELECT 
  id
, timestamp
, COUNT(DISTINCT CASE WHEN value > 0 AND NOT next_timestamp <= VALUE OF timestamp AT CURRENT_ROW THEN id END) OVER (PARTITION BY timestamp RANGE UNBOUNDED PRECEDING) as count_id_gt_0
FROM my_table

但在 google BigQuery 中無法識別VALUE OFSyntax error: Unexpected keyword OF

這里有一個查詢:

WITH data AS (
  SELECT * FROM UNNEST([
    STRUCT
    ('A' as id,1 as value, 1 as time_stamp), 
    ('B', 0, 1),
    ('C', 0, 1),
    ('A', 0, 2),
    ('B', 1, 2),
    ('C', 1, 3),
    ('B', 0, 4)
  ])
),
final_table AS (
  SELECT
    id
  , value
  , time_stamp
  , LEAD(time_stamp,1) OVER (PARTITION BY id ORDER BY time_stamp) AS next_time
  FROM data
)
  SELECT 
    id
  , value
  , time_stamp
  , next_time
  , COUNT( CASE WHEN value > 0 AND NOT next_time <= ft.time_stamp THEN id END) OVER(
      ORDER BY time_stamp 
      RANGE UNBOUNDED PRECEDING
    ) AS id_gt_0_array
  FROM final_table ft 

結果仍然不如預期,因為next_time <= ft.time_stamp被忽略了:

| id | value | timestamp  | id_gt_0        |
|:-- | ----- | ---------- | --------------:|
| A  | 1     | 2021-01-01 | 1              |
| B  | 0     | 2021-01-01 | 1              |
| C  | 0     | 2021-01-01 | 1              |
| A  | 0     | 2021-01-02 | 1              |
| B  | 1     | 2021-01-02 | 2              |
| C  | 1     | 2021-01-03 | 2              |
| B  | 0     | 2021-01-04 | 2              |

更新解決方案:

根據@Mikhail Berlyant 的建議,我得到了正確的結果,這個查詢也非常快:

select * except(new_value), 
  sum(new_value) over win as unique_ids
from (
  select *, 
    if(not lag(value) over by_id is null,
      if(lag(value) over by_id > 0,
        if(value > 0, 0, -1),
        if(value > 0, 1, 0)), 
      if(value > 0,1,0)
    ) new_value
  from final_table
  window by_id as (partition by id order by time_stamp)
)
window win as (order by time_stamp range between unbounded preceding and current row) 

謝謝!

考慮以下方法

select * except(new_value), 
  sum(new_value) over win as unique_ids
from (
  select *, 
    if(not lag(value) over by_id is null,
      if(lag(value) over by_id > 0, if(value = 0, -1, 0), 1), 
      value
    ) new_value
  from your_table
  window by_id as (partition by id order by timestamp)
)
window win as (order by timestamp range between unbounded preceding and current row)       

output

在此處輸入圖像描述

請注意:

  1. 以上未經測試,僅作為解決“> 10 億問題”的替代解決方案的示例編寫
  2. 雖然沒有經過全面測試 - 我做的非常快,看起來它按預期工作,至少對於你問題中的虛擬示例 output 是正確的
  3. 對於小數據,Jaytiger 已經提出的解決方案更有效。 但對於像你這樣的大/重案例 - 我認為這種方法很有可能更有效

希望這會有所幫助。 由於累積的 ARRAY_AGG ing,此查詢可能無法擴展。

WITH data AS (
  SELECT * FROM UNNEST([
    STRUCT
    ('A' as id,1 as value, 1 as time_stamp), 
    ('B', 0, 1),
    ('C', 0, 1),
    ('A', 0, 2),
    ('B', 1, 2),
    ('C', 1, 3),
    ('B', 0, 4)
  ])
),
array_agg AS (
  SELECT *, ARRAY_AGG(d) OVER (ORDER BY time_stamp) arr FROM data d
)
SELECT * EXCEPT(arr), 
       (SELECT COUNTIF(latest_value_by_id > 0) FROM (
          SELECT ARRAY_AGG(i.value ORDER BY i.time_stamp DESC LIMIT 1)[SAFE_OFFSET(0)] latest_value_by_id
            FROM t.arr i GROUP BY i.id
       )) AS id_gt_0
  FROM array_agg t;

+----+-------+------------+---------+
| id | value | time_stamp | id_gt_0 |
+----+-------+------------+---------+
| A  |     1 |          1 |       1 |
| B  |     0 |          1 |       1 |
| C  |     0 |          1 |       1 |
| A  |     0 |          2 |       1 |
| B  |     1 |          2 |       1 |
| C  |     1 |          3 |       2 |
| B  |     0 |          4 |       1 |
+----+-------+------------+---------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM