![](/img/trans.png)
[英]Issues connecting Google Data Studio to BigQuery with window function
[英]Nested window functions in Google BigQuery
我想计算每个时间戳过去所有时间戳的唯一 ID,以防 google BigQuery SQL 中 ID 的最后一个值大于 0。我不想GROUP BY
因为我需要整个表作为 output。另外该表有 > 10 亿行,因此查询应该是高效的。
想象一下我有一个这样的表:
| ID | value | timestamp |
|:-- | ----- | ----------:|
| A | 1 | 2021-01-01 |
| B | 0 | 2021-01-01 |
| C | 0 | 2021-01-01 |
| A | 0 | 2021-01-02 |
| B | 1 | 2021-01-02 |
| C | 1 | 2021-01-03 |
| B | 0 | 2021-01-04 |
结果应该是这样的:
| ID | value | timestamp | count_val_gt_0 |
|:-- | ----- | ---------- | --------------:|
| A | 1 | 2021-01-01 | 1 |
| B | 0 | 2021-01-01 | 1 |
| C | 0 | 2021-01-01 | 1 |
| A | 0 | 2021-01-02 | 1 |
| B | 1 | 2021-01-02 | 1 |
| C | 1 | 2021-01-03 | 2 |
| B | 0 | 2021-01-04 | 1 |
解释:
timestamp - set of unique IDs with last value > 0
2021-01-01: {A}
2021-02-01: {B}
2021-03-01: {B,C}
2021-04-01: {C}
对于时间戳 2021-01-01,只有 A 的值大于 0。在此之前没有时间戳。 对于时间戳为 2021-01-02 的所有行,我正在计算唯一 ID,以防此 ID 的最后一个值在时间戳 2021-01-01 和 2021-01-02 期间大于 0。 A 的最后一个值不再大于 0,但现在 B 大于 0。 对于时间戳 2021-01-03,B 的最后一个值仍然大于 0,现在也是 C 的最后一个值,所以我数 2。对于时间戳 2021-01-04,B 不再大于 0,因此它只是 C:1 .
我尝试的是遵循这种方法(在“行函数中的嵌套 value_of 表达式”中),如下所示:
我添加了一个next_timestamp
字段,显示下一次出现的 ID 并尝试:
SELECT
id
, timestamp
, COUNT(DISTINCT CASE WHEN value > 0 AND NOT next_timestamp <= VALUE OF timestamp AT CURRENT_ROW THEN id END) OVER (PARTITION BY timestamp RANGE UNBOUNDED PRECEDING) as count_id_gt_0
FROM my_table
但在 google BigQuery 中无法识别VALUE OF
: Syntax error: Unexpected keyword OF
这里有一个查询:
WITH data AS (
SELECT * FROM UNNEST([
STRUCT
('A' as id,1 as value, 1 as time_stamp),
('B', 0, 1),
('C', 0, 1),
('A', 0, 2),
('B', 1, 2),
('C', 1, 3),
('B', 0, 4)
])
),
final_table AS (
SELECT
id
, value
, time_stamp
, LEAD(time_stamp,1) OVER (PARTITION BY id ORDER BY time_stamp) AS next_time
FROM data
)
SELECT
id
, value
, time_stamp
, next_time
, COUNT( CASE WHEN value > 0 AND NOT next_time <= ft.time_stamp THEN id END) OVER(
ORDER BY time_stamp
RANGE UNBOUNDED PRECEDING
) AS id_gt_0_array
FROM final_table ft
结果仍然不如预期,因为next_time <= ft.time_stamp
被忽略了:
| id | value | timestamp | id_gt_0 |
|:-- | ----- | ---------- | --------------:|
| A | 1 | 2021-01-01 | 1 |
| B | 0 | 2021-01-01 | 1 |
| C | 0 | 2021-01-01 | 1 |
| A | 0 | 2021-01-02 | 1 |
| B | 1 | 2021-01-02 | 2 |
| C | 1 | 2021-01-03 | 2 |
| B | 0 | 2021-01-04 | 2 |
更新解决方案:
根据@Mikhail Berlyant 的建议,我得到了正确的结果,这个查询也非常快:
select * except(new_value),
sum(new_value) over win as unique_ids
from (
select *,
if(not lag(value) over by_id is null,
if(lag(value) over by_id > 0,
if(value > 0, 0, -1),
if(value > 0, 1, 0)),
if(value > 0,1,0)
) new_value
from final_table
window by_id as (partition by id order by time_stamp)
)
window win as (order by time_stamp range between unbounded preceding and current row)
谢谢!
考虑以下方法
select * except(new_value),
sum(new_value) over win as unique_ids
from (
select *,
if(not lag(value) over by_id is null,
if(lag(value) over by_id > 0, if(value = 0, -1, 0), 1),
value
) new_value
from your_table
window by_id as (partition by id order by timestamp)
)
window win as (order by timestamp range between unbounded preceding and current row)
output
请注意:
希望这会有所帮助。 由于累积的 ARRAY_AGG ing,此查询可能无法扩展。
WITH data AS (
SELECT * FROM UNNEST([
STRUCT
('A' as id,1 as value, 1 as time_stamp),
('B', 0, 1),
('C', 0, 1),
('A', 0, 2),
('B', 1, 2),
('C', 1, 3),
('B', 0, 4)
])
),
array_agg AS (
SELECT *, ARRAY_AGG(d) OVER (ORDER BY time_stamp) arr FROM data d
)
SELECT * EXCEPT(arr),
(SELECT COUNTIF(latest_value_by_id > 0) FROM (
SELECT ARRAY_AGG(i.value ORDER BY i.time_stamp DESC LIMIT 1)[SAFE_OFFSET(0)] latest_value_by_id
FROM t.arr i GROUP BY i.id
)) AS id_gt_0
FROM array_agg t;
+----+-------+------------+---------+
| id | value | time_stamp | id_gt_0 |
+----+-------+------------+---------+
| A | 1 | 1 | 1 |
| B | 0 | 1 | 1 |
| C | 0 | 1 | 1 |
| A | 0 | 2 | 1 |
| B | 1 | 2 | 1 |
| C | 1 | 3 | 2 |
| B | 0 | 4 | 1 |
+----+-------+------------+---------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.