简体   繁体   English

Google BigQuery - 为什么 window function order by cause memory error although used with partition by

[英]Google BigQuery - why does window function order by cause memory error although used together with partition by

I get a memory error in google BigQuery that I don't understand:我在 google BigQuery 中收到一个我不理解的 memory 错误:

My base table (> 1 billion rows) consists of a user ID, a balance increment per day and the day.我的基表(> 10 亿行)由一个用户 ID、每天和每天的余额增量组成。 From the balance_increment per day I want to return the total balance each time there is a new increment.从每天的 balance_increment 中,我想在每次有新的增量时返回总余额。 For the next step I would also require the next day there is a new balance increment.对于下一步,我还需要第二天有一个新的余额增量。 So I do:所以我这样做:

select 
    userID
    ,   date
    ,   sum(balance_increment) over (partition by userID order by date) as balance
    ,   lead(date, 1, current_date()) over (partition by userID order by date) as next_date
from my_base_table

Although I used partition by in the over clause I get a memory error with this query caused by the sort operation (the order by if I understood corectly?):尽管我在over子句中使用了partition by ,但由于排序操作(如果我正确理解了顺序依据?),我在该查询中遇到了 memory 错误:

BadRequest: 400 Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 135% of limit.
Top memory consumer(s):
  sort operations used for analytic OVER() clauses: 98%
  other/unattributed: 2%

But when I check how often a unique user ID appears, the most is not even 4000 times.但是当我检查一个唯一用户 ID 出现的频率时,最多甚至不到 4000 次。 I know that I have a bunch of userIDs (apparently > 31 million as the image (see below) suggests, but I thought when doing a partition by the query will be separated into different slots if necessary?我知道我有一堆 userID(显然 > 3100 万,如图像(见下文)所示,但我认为在partition by时会在必要时分成不同的槽?

Here I check how often a single userID occurs.在这里,我检查单个 userID 出现的频率。 This query btw.这个查询顺便说一句。 works just fine:工作得很好:

SELECT
  userID
  , count(*) as userID_count
FROM my_base_table
GROUP BY userID
ORDER BY userID_count DESC

(sorry, in the image I called it entity instead of userID) (抱歉,在图片中我称它为实体而不是用户 ID)

在此处输入图像描述

So my questions are:所以我的问题是:

  1. Did I understand it correctly that the memory error comes from the order by date ?我是否正确理解 memory 错误来自order by date
  2. Why is that a big issue when I have less than 4000 occurences that have to be ordered when I use the partition by ?当我使用partition by时必须排序的次数少于 4000 次时,为什么这是一个大问题?
  3. Why does my second query run through although at the end I have to order > 31 million rows?为什么我的第二个查询会运行,尽管最后我必须订购 > 3100 万行?
  4. How can I solve this issue?我该如何解决这个问题?

I solved the memory issue by pre-ordering the base table by userID and date as suggested by @Samuel who pointed out, that preordering should reduce the key exchange over the nodes - it worked!我解决了 memory 问题,按照@Samuel 的建议通过userIDdate对基表进行预排序,他指出,预排序应该减少节点上的密钥交换 - 它起作用了!

With ordered_base_table as (
Select * from my_base_table order by userID, date
)

select 
    userID
    ,   date
    ,   sum(balance_increment) over (partition by userID order by date) as balance
    ,   lead(date, 1, current_date()) over (partition by userID order by date) as next_date
from ordered_base_table

Thanks!谢谢!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 window function 将 Google Data Studio 连接到 BigQuery 的问题 - Issues connecting Google Data Studio to BigQuery with window function 查询执行期间超出了 Google BigQuery 资源。 如何在 SQL 中拆分带有分区的大 window 帧 - Google BigQuery Resources exceeded during query execution. How to split large window frames with partition in SQL Google BigQuery 中的嵌套函数 window - Nested window functions in Google BigQuery Bigquery - 谷歌身份验证不直接到 url - Bigquery - google auth does not direct to url 使用 BigQuery 存储写入 API 的 Google 数据流存储到特定分区 - Google Dataflow store to specific Partition using BigQuery Storage Write API google bigquery 如何在运行查询之前计算所需的 memory? - How google bigquery calculate the memory needed before running the query? 查询以在 bigquery 中获取表分区元数据时出错 - Getting error while querying to get tables partition metadata in bigquery 错误:无法识别的名称:Google BigQuery - Error: Unrecognized name: Google BigQuery bigquery:从查询结果创建分区表不会分区旧时间戳 - bigquery: create partitioned table from query results does NOT partition old timestamps 如何使用 Union All function 指定 BigQuery 中显示的数据顺序 - How to specify the order of data displayed in BigQuery using the Union All function
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM