获取多个分区中的下一个（或上一个）非空值

Question

Sample data below.下面的示例数据。

I want to clean up data based on the next non-null value of the same id, based on row (actually a timestamp).我想根据行（实际上是时间戳）基于相同 id 的下一个非空值清理数据。

I can't do lag, because in some cases there are consecutive nulls.我不能做滞后，因为在某些情况下会有连续的空值。
I can't do coalesce(a.col_a, (select min(b.col_a) from table b where a.id=b.id)) because it will return an "outdated" value (eg NYC instead of SF in col_a row 4).我不能做 coalesce(a.col_a, (select min(b.col_a) from table b where a.id=b.id)) 因为它会返回一个“过时”的值（例如，在 col_a 行中是 NYC 而不是 SF 4）。 (I can do this, once I've accounted for everything else, for the cases where i have no next non-null value, like col_b row 9/10, to just fill in the last). （我可以这样做，一旦我考虑了其他所有内容，对于我没有下一个非空值的情况，比如 col_b 第 9/10 行，只填写最后一个）。

The only thing I can think of is to do我唯一能想到的就是做

table_x as (select id, col_x from table where col_a is not null)

for each column, and then join taking the minimum where id = id and table_x.row > table.row.对于每一列，然后以 id = id 和 table_x.row > table.row 的最小值连接。 But I have a handful of columns and that feels cumbersome and inefficient.但是我有一些专栏，感觉很麻烦而且效率低下。

Appreciate any help!感谢任何帮助！

row排	id ID	col_a可乐	col_a_desired col_a_desired	col_b col_b	col_b_desired col_b_desired
0 0	1 1	- -	NYC纽约市	red红色的	red红色的
1 1	1 1	NYC纽约市	NYC纽约市	red红色的	red红色的
2 2	1 1	SF顺丰	SF顺丰	- -	blue蓝色的
3 3	1 1	- -	SF顺丰	- -	blue蓝色的
4 4	1 1	SF顺丰	SF顺丰	blue蓝色的	blue蓝色的
5 5	2 2	PAR标准杆	PAR标准杆	red红色的	red红色的
6 6	2 2	LON伦敦	LON伦敦	- -	blue蓝色的
7 7	2 2	LON伦敦	LON伦敦	- -	blue蓝色的
8 8	2 2	- -	LON伦敦	blue蓝色的	blue蓝色的
9 9	2 2	LON伦敦	LON伦敦	- -	blue蓝色的
10 10	2 2	- -	LON伦敦	- -	blue蓝色的

Answer 1

I want to clean up data based on the next non-null value.我想根据下一个非空值清理数据。

So if you reverse the order, that's the last non-null value.所以如果你颠倒顺序，那是最后一个非空值。

If you have multiple columns and the logic is too cumbersome to write in SQL, you can write it in plpgsql instead, or even use the script language of your choice (but that will be slower).如果你有多个列，并且逻辑太繁琐而无法在 SQL 中编写，则可以改为使用 plpgsql 编写，甚至使用您选择的脚本语言（但这会更慢）。

The idea is to open a cursor for update, with an ORDER BY in the reverse order mentioned in the question.这个想法是打开一个 cursor 进行更新，其中 ORDER BY 的顺序与问题中提到的相反。 Then the plpgsql code stores the last non-null values in variables, and if needed issues an UPDATE WHERE CURRENT OF cursor to replace the nulls in the table with desired values.然后 plpgsql 代码将最后的非空值存储在变量中，如果需要，发出 UPDATE WHERE CURRENT OF cursor 以将表中的空值替换为所需的值。

This may take a while, and the numerous updates will take a lot of locks.这可能需要一段时间，而且大量的更新会占用大量的锁。 It looks like your data can be processed in independent chunks using the "id" column as chunk identifier, so it would be a good idea to use that.看起来您的数据可以使用“id”列作为块标识符在独立的块中进行处理，因此使用它是一个好主意。

Answer 2

Can you try this query?你可以试试这个查询吗？

WITH samp AS (
  SELECT 0 row_id, 1 id, null col_a, 'red' col_b UNION ALL
  SELECT 1, 1, 'NYC', 'red' UNION ALL
  SELECT 2, 1, 'SF', NULL UNION ALL
  SELECT 3, 1, NULL, NULL UNION ALL
  SELECT 4, 1, 'SF', 'blue' UNION ALL
  SELECT 5, 2, 'PAR', 'red' UNION ALL
  SELECT 6, 2, 'LON', NULL UNION ALL
  SELECT 7, 2, 'LON', NULL UNION ALL
  SELECT 8, 2, NULL, 'blue' UNION ALL
  SELECT 9, 2, 'LON', NULL UNION ALL
  SELECT 10, 2, NULL, NULL
)
  SELECT
  row_id,
  id,
  IFNULL(FIRST_VALUE(col_a IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
  FIRST_VALUE(col_a IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id desc
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_a,
  IFNULL(FIRST_VALUE(col_b IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
  FIRST_VALUE(col_b IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id desc
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_b
 from samp order by id, row_id

Output: Output：

References: https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls参考： https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls

获取多个分区中的下一个（或上一个）非空值

问题描述

2 个解决方案

解决方案1
0 2022-08-20 12:06:49

解决方案2
0 2022-08-22 09:01:46

获取多个分区中的下一个（或上一个）非空值

问题描述

2 个解决方案

解决方案1 0 2022-08-20 12:06:49

解决方案2 0 2022-08-22 09:01:46

解决方案1
0 2022-08-20 12:06:49

解决方案2
0 2022-08-22 09:01:46