简体   繁体   中英

Get the next (or previous) non-null value in multiple partitioned

Sample data below.

I want to clean up data based on the next non-null value of the same id, based on row (actually a timestamp).

  • I can't do lag, because in some cases there are consecutive nulls.
  • I can't do coalesce(a.col_a, (select min(b.col_a) from table b where a.id=b.id)) because it will return an "outdated" value (eg NYC instead of SF in col_a row 4). (I can do this, once I've accounted for everything else, for the cases where i have no next non-null value, like col_b row 9/10, to just fill in the last).

The only thing I can think of is to do

table_x as (select id, col_x from table where col_a is not null)

for each column, and then join taking the minimum where id = id and table_x.row > table.row. But I have a handful of columns and that feels cumbersome and inefficient.

Appreciate any help!

row id col_a col_a_desired col_b col_b_desired
0 1 - NYC red red
1 1 NYC NYC red red
2 1 SF SF - blue
3 1 - SF - blue
4 1 SF SF blue blue
5 2 PAR PAR red red
6 2 LON LON - blue
7 2 LON LON - blue
8 2 - LON blue blue
9 2 LON LON - blue
10 2 - LON - blue

I want to clean up data based on the next non-null value.

So if you reverse the order, that's the last non-null value.

If you have multiple columns and the logic is too cumbersome to write in SQL, you can write it in plpgsql instead, or even use the script language of your choice (but that will be slower).

The idea is to open a cursor for update, with an ORDER BY in the reverse order mentioned in the question. Then the plpgsql code stores the last non-null values in variables, and if needed issues an UPDATE WHERE CURRENT OF cursor to replace the nulls in the table with desired values.

This may take a while, and the numerous updates will take a lot of locks. It looks like your data can be processed in independent chunks using the "id" column as chunk identifier, so it would be a good idea to use that.

Can you try this query?

WITH samp AS (
  SELECT 0 row_id, 1 id, null col_a, 'red' col_b UNION ALL
  SELECT 1, 1, 'NYC', 'red' UNION ALL
  SELECT 2, 1, 'SF', NULL UNION ALL
  SELECT 3, 1, NULL, NULL UNION ALL
  SELECT 4, 1, 'SF', 'blue' UNION ALL
  SELECT 5, 2, 'PAR', 'red' UNION ALL
  SELECT 6, 2, 'LON', NULL UNION ALL
  SELECT 7, 2, 'LON', NULL UNION ALL
  SELECT 8, 2, NULL, 'blue' UNION ALL
  SELECT 9, 2, 'LON', NULL UNION ALL
  SELECT 10, 2, NULL, NULL
)
  SELECT
  row_id,
  id,
  IFNULL(FIRST_VALUE(col_a IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
  FIRST_VALUE(col_a IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id desc
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_a,
  IFNULL(FIRST_VALUE(col_b IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
  FIRST_VALUE(col_b IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id desc
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_b
 from samp order by id, row_id

Output: 在此处输入图像描述

References: https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM