[英]Get the next (or previous) non-null value in multiple partitioned
Sample data below.下面的示例数据。
I want to clean up data based on the next non-null value of the same id, based on row (actually a timestamp).我想根据行(实际上是时间戳)基于相同 id 的下一个非空值清理数据。
The only thing I can think of is to do我唯一能想到的就是做
table_x as (select id, col_x from table where col_a is not null)
for each column, and then join taking the minimum where id = id and table_x.row > table.row.对于每一列,然后以 id = id 和 table_x.row > table.row 的最小值连接。 But I have a handful of columns and that feels cumbersome and inefficient.
但是我有一些专栏,感觉很麻烦而且效率低下。
Appreciate any help!感谢任何帮助!
row![]() |
id ![]() |
col_a![]() |
col_a_desired ![]() |
col_b ![]() |
col_b_desired ![]() |
---|---|---|---|---|---|
0 ![]() |
1 ![]() |
- ![]() |
NYC![]() |
red![]() |
red![]() |
1 ![]() |
1 ![]() |
NYC![]() |
NYC![]() |
red![]() |
red![]() |
2 ![]() |
1 ![]() |
SF![]() |
SF![]() |
- ![]() |
blue![]() |
3 ![]() |
1 ![]() |
- ![]() |
SF![]() |
- ![]() |
blue![]() |
4 ![]() |
1 ![]() |
SF![]() |
SF![]() |
blue![]() |
blue![]() |
5 ![]() |
2 ![]() |
PAR![]() |
PAR![]() |
red![]() |
red![]() |
6 ![]() |
2 ![]() |
LON![]() |
LON![]() |
- ![]() |
blue![]() |
7 ![]() |
2 ![]() |
LON![]() |
LON![]() |
- ![]() |
blue![]() |
8 ![]() |
2 ![]() |
- ![]() |
LON![]() |
blue![]() |
blue![]() |
9 ![]() |
2 ![]() |
LON![]() |
LON![]() |
- ![]() |
blue![]() |
10 ![]() |
2 ![]() |
- ![]() |
LON![]() |
- ![]() |
blue![]() |
I want to clean up data based on the next non-null value.
我想根据下一个非空值清理数据。
So if you reverse the order, that's the last non-null value.所以如果你颠倒顺序,那是最后一个非空值。
If you have multiple columns and the logic is too cumbersome to write in SQL, you can write it in plpgsql instead, or even use the script language of your choice (but that will be slower).如果你有多个列,并且逻辑太繁琐而无法在 SQL 中编写,则可以改为使用 plpgsql 编写,甚至使用您选择的脚本语言(但这会更慢)。
The idea is to open a cursor for update, with an ORDER BY in the reverse order mentioned in the question.这个想法是打开一个 cursor 进行更新,其中 ORDER BY 的顺序与问题中提到的相反。 Then the plpgsql code stores the last non-null values in variables, and if needed issues an UPDATE WHERE CURRENT OF cursor to replace the nulls in the table with desired values.
然后 plpgsql 代码将最后的非空值存储在变量中,如果需要,发出 UPDATE WHERE CURRENT OF cursor 以将表中的空值替换为所需的值。
This may take a while, and the numerous updates will take a lot of locks.这可能需要一段时间,而且大量的更新会占用大量的锁。 It looks like your data can be processed in independent chunks using the "id" column as chunk identifier, so it would be a good idea to use that.
看起来您的数据可以使用“id”列作为块标识符在独立的块中进行处理,因此使用它是一个好主意。
Can you try this query?你可以试试这个查询吗?
WITH samp AS (
SELECT 0 row_id, 1 id, null col_a, 'red' col_b UNION ALL
SELECT 1, 1, 'NYC', 'red' UNION ALL
SELECT 2, 1, 'SF', NULL UNION ALL
SELECT 3, 1, NULL, NULL UNION ALL
SELECT 4, 1, 'SF', 'blue' UNION ALL
SELECT 5, 2, 'PAR', 'red' UNION ALL
SELECT 6, 2, 'LON', NULL UNION ALL
SELECT 7, 2, 'LON', NULL UNION ALL
SELECT 8, 2, NULL, 'blue' UNION ALL
SELECT 9, 2, 'LON', NULL UNION ALL
SELECT 10, 2, NULL, NULL
)
SELECT
row_id,
id,
IFNULL(FIRST_VALUE(col_a IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
FIRST_VALUE(col_a IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id desc
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_a,
IFNULL(FIRST_VALUE(col_b IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
FIRST_VALUE(col_b IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id desc
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_b
from samp order by id, row_id
References: https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls参考: https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.