简体   繁体   English

高效的前向填充 bigquery

[英]efficient forward fill bigquery

I am trying to forward fill a table in bigquery but I am running out of resourses when executing the query.我正在尝试在 bigquery 中转发填充表,但在执行查询时资源不足。 Table size is 2GB.表大小为 2GB。 The table looks like this one:该表如下所示:

with t as (
    select timestamp '2021-05-01 00:00:01' as time, 10 as number union all
    select timestamp '2021-05-01 05:00:01' as time, NULL as number union all
    select timestamp '2021-05-01 23:00:01' as time, 20 as number union all
    select timestamp '2021-05-02 00:00:01' as time, NULL as number union all
    select timestamp '2021-05-02 01:00:01' as time, NULL as number union all 
    select timestamp '2021-05-02 05:00:01' as time, 12 as number
)
time时间 number数字
2021-05-01 00:00:01 2021-05-01 00:00:01 10 10
2021-05-01 05:00:01 2021-05-01 05:00:01 NULL NULL
2021-05-01 23:00:01 2021-05-01 23:00:01 20 20
2021-05-02 00:00:01 2021-05-02 00:00:01 NULL NULL
2021-05-02 01:00:01 2021-05-02 01:00:01 NULL NULL
2021-05-02 05:00:01 2021-05-02 05:00:01 12 12

The desired output is:所需的 output 是:

time时间 number数字
2021-05-01 00:00:01 2021-05-01 00:00:01 10 10
2021-05-01 05:00:01 2021-05-01 05:00:01 10 10
2021-05-01 23:00:01 2021-05-01 23:00:01 20 20
2021-05-02 00:00:01 2021-05-02 00:00:01 20 20
2021-05-02 01:00:01 2021-05-02 01:00:01 20 20
2021-05-02 05:00:01 2021-05-02 05:00:01 12 12

My solution at the moment is:我目前的解决方案是:

SELECT time,
LAST_VALUE(number IGNORE NULLS) OVER(ORDER BY time) AS number
FROM t

It throws:它抛出:

Resources exceeded during query execution: The query could not be executed in the allotted memory.

The problem is the OVER with ORDER BY.问题是 ORDER BY 的 OVER。 I tried to run the query with a partition by day and it is executed successfully.我尝试按天使用分区运行查询,并成功执行。

SELECT time,
LAST_VALUE(number IGNORE NULLS) OVER(PARTITION BY DATETIME_TRUNC(time, day) ORDER BY time) AS number
FROM t
time时间 number数字
2021-05-01 00:00:01 2021-05-01 00:00:01 10 10
2021-05-01 05:00:01 2021-05-01 05:00:01 10 10
2021-05-01 23:00:01 2021-05-01 23:00:01 20 20
2021-05-02 00:00:01 2021-05-02 00:00:01 NULL NULL
2021-05-02 01:00:01 2021-05-02 01:00:01 NULL NULL
2021-05-02 05:00:01 2021-05-02 05:00:01 12 12

The problem is that it still has null values, but about 500 times less than the original table.问题是它仍然有 null 值,但比原始表少了大约 500 倍。 Not sure if the problem can be solved based on this.不确定是否可以基于此解决问题。 Is there any efficient way to solve this?有没有有效的方法来解决这个问题?

Try below试试下面

SELECT time, 
NTH_VALUE(number, 1 IGNORE NULLS) OVER(ORDER BY time DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) AS number
FROM t

OR要么

SELECT time, 
  FIRST_VALUE(number IGNORE NULLS) OVER(ORDER BY time DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) AS number
FROM t    

I don't have good example of real data to test - so just a guess我没有要测试的真实数据的好例子 - 所以只是猜测

Changed the datetime partition from day to month and it filled.将 datetime 分区从一天更改为一个月,并填满。

The following resolved it for me:以下为我解决了它:

with t as (
        select timestamp '2021-05-01 00:00:01' as time, 10 as number union all
        select timestamp '2021-05-01 05:00:01' as time, NULL as number union all
        select timestamp '2021-05-01 23:00:01' as time, 20 as number union all
        select timestamp '2021-05-02 00:00:01' as time, NULL as number union all
        select timestamp '2021-05-02 01:00:01' as time, NULL as number union all 
        select timestamp '2021-05-02 05:00:01' as time, 12 as number
    )


SELECT time,
LAST_VALUE(number IGNORE NULLS) OVER(PARTITION BY DATETIME_TRUNC(time, month) ORDER BY time) AS number
FROM t

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM