简体   繁体   English

获取多个分区中的下一个(或上一个)非空值

[英]Get the next (or previous) non-null value in multiple partitioned

Sample data below.下面的示例数据。

I want to clean up data based on the next non-null value of the same id, based on row (actually a timestamp).我想根据行(实际上是时间戳)基于相同 id 的下一个非空值清理数据。

  • I can't do lag, because in some cases there are consecutive nulls.我不能做滞后,因为在某些情况下会有连续的空值。
  • I can't do coalesce(a.col_a, (select min(b.col_a) from table b where a.id=b.id)) because it will return an "outdated" value (eg NYC instead of SF in col_a row 4).我不能做 coalesce(a.col_a, (select min(b.col_a) from table b where a.id=b.id)) 因为它会返回一个“过时”的值(例如,在 col_a 行中是 NYC 而不是 SF 4)。 (I can do this, once I've accounted for everything else, for the cases where i have no next non-null value, like col_b row 9/10, to just fill in the last). (我可以这样做,一旦我考虑了其他所有内容,对于我没有下一个非空值的情况,比如 col_b 第 9/10 行,只填写最后一个)。

The only thing I can think of is to do我唯一能想到的就是做

table_x as (select id, col_x from table where col_a is not null)

for each column, and then join taking the minimum where id = id and table_x.row > table.row.对于每一列,然后以 id = id 和 table_x.row > table.row 的最小值连接。 But I have a handful of columns and that feels cumbersome and inefficient.但是我有一些专栏,感觉很麻烦而且效率低下。

Appreciate any help!感谢任何帮助!

row id ID col_a可乐 col_a_desired col_a_desired col_b col_b col_b_desired col_b_desired
0 0 1 1 - - NYC纽约市 red红色的 red红色的
1 1 1 1 NYC纽约市 NYC纽约市 red红色的 red红色的
2 2 1 1 SF顺丰 SF顺丰 - - blue蓝色的
3 3 1 1 - - SF顺丰 - - blue蓝色的
4 4 1 1 SF顺丰 SF顺丰 blue蓝色的 blue蓝色的
5 5 2 2 PAR标准杆 PAR标准杆 red红色的 red红色的
6 6 2 2 LON伦敦 LON伦敦 - - blue蓝色的
7 7 2 2 LON伦敦 LON伦敦 - - blue蓝色的
8 8 2 2 - - LON伦敦 blue蓝色的 blue蓝色的
9 9 2 2 LON伦敦 LON伦敦 - - blue蓝色的
10 10 2 2 - - LON伦敦 - - blue蓝色的

I want to clean up data based on the next non-null value.我想根据下一个非空值清理数据。

So if you reverse the order, that's the last non-null value.所以如果你颠倒顺序,那是最后一个非空值。

If you have multiple columns and the logic is too cumbersome to write in SQL, you can write it in plpgsql instead, or even use the script language of your choice (but that will be slower).如果你有多个列,并且逻辑太繁琐而无法在 SQL 中编写,则可以改为使用 plpgsql 编写,甚至使用您选择的脚本语言(但这会更慢)。

The idea is to open a cursor for update, with an ORDER BY in the reverse order mentioned in the question.这个想法是打开一个 cursor 进行更新,其中 ORDER BY 的顺序与问题中提到的相反。 Then the plpgsql code stores the last non-null values in variables, and if needed issues an UPDATE WHERE CURRENT OF cursor to replace the nulls in the table with desired values.然后 plpgsql 代码将最后的非空值存储在变量中,如果需要,发出 UPDATE WHERE CURRENT OF cursor 以将表中的空值替换为所需的值。

This may take a while, and the numerous updates will take a lot of locks.这可能需要一段时间,而且大量的更新会占用大量的锁。 It looks like your data can be processed in independent chunks using the "id" column as chunk identifier, so it would be a good idea to use that.看起来您的数据可以使用“id”列作为块标识符在独立的块中进行处理,因此使用它是一个好主意。

Can you try this query?你可以试试这个查询吗?

WITH samp AS (
  SELECT 0 row_id, 1 id, null col_a, 'red' col_b UNION ALL
  SELECT 1, 1, 'NYC', 'red' UNION ALL
  SELECT 2, 1, 'SF', NULL UNION ALL
  SELECT 3, 1, NULL, NULL UNION ALL
  SELECT 4, 1, 'SF', 'blue' UNION ALL
  SELECT 5, 2, 'PAR', 'red' UNION ALL
  SELECT 6, 2, 'LON', NULL UNION ALL
  SELECT 7, 2, 'LON', NULL UNION ALL
  SELECT 8, 2, NULL, 'blue' UNION ALL
  SELECT 9, 2, 'LON', NULL UNION ALL
  SELECT 10, 2, NULL, NULL
)
  SELECT
  row_id,
  id,
  IFNULL(FIRST_VALUE(col_a IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
  FIRST_VALUE(col_a IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id desc
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_a,
  IFNULL(FIRST_VALUE(col_b IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
  FIRST_VALUE(col_b IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id desc
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_b
 from samp order by id, row_id

Output: Output: 在此处输入图像描述

References: https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls参考: https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 bigQuery 中仅返回非空键/值 - Returning only non-null key/value in bigQuery 未处理的异常:PlatformException(空错误,主机平台为非空返回值返回 null 值。,null,空) - Unhandled Exception: PlatformException(null-error, Host platform returned null value for non-null return value., null, null) FirebaseCloudMessaging: PlatformException (PlatformException(null-error, Host platform returned null value for non-null return value., null, null)) - FirebaseCloudMessaging : PlatformException (PlatformException(null-error, Host platform returned null value for non-null return value., null, null)) 必须返回非空值,因为返回类型“UserCredentialPlatform”不允许空值 - A non-null value must be returned since the return type 'UserCredentialPlatform' doesn't allow null select 最后一个非空值和 append 到另一列 BigQuery/PYTHON - select last non-null value and append it to another column BigQuery/PYTHON 从非分区表迁移到分区表 - Migrating from non-partitioned to Partitioned tables 必须向 Text 小部件提供非空字符串。 断言失败:第 378 行 pos 10: 'data != null' - A non-null String must be provided to a Text widget. Failed assertion: line 378 pos 10: 'data != null' 查询多个分区表 - Query multiple partitioned tables 将 nulll 值与非 null 值 SQL 进行比较 - Compare nulll value with non null value SQL 在 BigQuery 中的数组中获取 NEXT 值 - Get the NEXT value in an array in BigQuery
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM