![](/img/trans.png)
[英]Issues connecting Google Data Studio to BigQuery with window function
[英]Google BigQuery - why does window function order by cause memory error although used together with partition by
我在 google BigQuery 中收到一个我不理解的 memory 错误:
我的基表(> 10 亿行)由一个用户 ID、每天和每天的余额增量组成。 从每天的 balance_increment 中,我想在每次有新的增量时返回总余额。 对于下一步,我还需要第二天有一个新的余额增量。 所以我这样做:
select
userID
, date
, sum(balance_increment) over (partition by userID order by date) as balance
, lead(date, 1, current_date()) over (partition by userID order by date) as next_date
from my_base_table
尽管我在over
子句中使用了partition by
,但由于排序操作(如果我正确理解了顺序依据?),我在该查询中遇到了 memory 错误:
BadRequest: 400 Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 135% of limit.
Top memory consumer(s):
sort operations used for analytic OVER() clauses: 98%
other/unattributed: 2%
但是当我检查一个唯一用户 ID 出现的频率时,最多甚至不到 4000 次。 我知道我有一堆 userID(显然 > 3100 万,如图像(见下文)所示,但我认为在partition by
时会在必要时分成不同的槽?
在这里,我检查单个 userID 出现的频率。 这个查询顺便说一句。 工作得很好:
SELECT
userID
, count(*) as userID_count
FROM my_base_table
GROUP BY userID
ORDER BY userID_count DESC
(抱歉,在图片中我称它为实体而不是用户 ID)
所以我的问题是:
order by date
?partition by
时必须排序的次数少于 4000 次时,为什么这是一个大问题? 我解决了 memory 问题,按照@Samuel 的建议通过userID
和date
对基表进行预排序,他指出,预排序应该减少节点上的密钥交换 - 它起作用了!
With ordered_base_table as (
Select * from my_base_table order by userID, date
)
select
userID
, date
, sum(balance_increment) over (partition by userID order by date) as balance
, lead(date, 1, current_date()) over (partition by userID order by date) as next_date
from ordered_base_table
谢谢!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.