简体   繁体   中英

Obtain latest NOT NULL values for different columns in a table, grouped by common column

In a PostgreSQL database, I have a table of measurements that looks as follows:

| sensor_group_id | ts                        | value_1 | value_2 | etc... |
|-----------------|---------------------------|---------|---------|--------|
| 1               | 2021-07-21T00:20:00+00:00 | 15      | NULL    |        |
| 1               | 2021-07-15T00:20:00+00:00 | NULL    | 23      |        |
| 2               | 2021-07-17T00:20:00+00:00 | NULL    | 11      |        |
| 1               | 2021-07-13T00:20:00+00:00 | 9       | 4       |        |
| 2               | 2021-07-10T00:20:00+00:00 | 99      | 36      |        |

There are many columns with different types of measurements in this table. Each Sensor Group produces measurements of different types at the same time, but not always all types. So we end up with partly filled rows.

What I want to do:

  • For each different sensor_group_id
  • For each different column (measurement type)
  • Obtain the latest timestamp when that column was NOT NULL and the value for that measurement at that timestamp

The solution I have now, seems pretty cumbersome:

WITH
    latest_value_1 AS (SELECT DISTINCT ON (sensor_group_id) sensor_group_id, ts, value_1
                                  FROM measurements
                                  WHERE value_1 IS NOT NULL
                                  ORDER BY sensor_group_id, ts DESC),
    latest_value_2 AS (SELECT DISTINCT ON (sensor_group_id) sensor_group_id, ts, value_2
                                  FROM measurements
                                  WHERE value_2 IS NOT NULL
                                  ORDER BY sensor_group_id, ts DESC),
    latest_value_3 AS (SELECT DISTINCT ON (sensor_group_id) sensor_group_id, ts, value_3
                                  FROM measurements
                                  WHERE value_3 IS NOT NULL
                                  ORDER BY sensor_group_id, ts DESC),
etc...
SELECT latest_value_1.sensor_group_id,
       latest_value_1.ts        AS latest_value_1_ts,
       value_1,
       latest_value_2.ts        AS latest_value_2_ts,
       value_2,
       latest_value_3.ts        AS latest_value_3_ts,
       value_3,
       etc...
FROM lastest_value_1
         JOIN latest_value_2
              ON latest_value_1.sensor_group_id = latest_value_2.sensor_group_id
         JOIN latest_value_2
              ON latest_value_1.sensor_group_id = latest_value_2.sensor_group_id
         JOIN latest_value_3
              ON latest_value_1.sensor_group_id = latest_value_3.sensor_group_id
        etc...

This produces the following result:

sensor_group_id latest_value_1_ts value_1 latest_value_2_ts value_2 etc...
1 2021-07-21T00:20:00+00:00 15 2021-07-21T00:20:00+00:00 23
2 2021-07-10T00:20:00+00:00 99 2021-07-17T00:20:00+00:00 11

This seems outrageously complicated, but I'm not sure if there is a better approach. Help would be much appreciated!

Not sure is it simpler...

with
  sensor_groups(sgr_id) as ( -- Change it to the list of groups if you have it
    select distinct sensor_group_id from measurements)
select
  *
from
  sensor_groups as sg
    left join lateral (
      select ts, value_1
      from measurements
      where value_1 is not null and sensor_group_id = sg.sgr_id
      order by ts desc limit 1) as v1(ts_1, v_1) on true
    left join lateral (
      select ts, value_2
      from measurements
      where value_2 is not null and sensor_group_id = sg.sgr_id
      order by ts desc limit 1) as v2(ts_2, v_2) on true
    ...

PS: Data normalization could help a lot

What you really want is the IGNORE NULLS option on LAG() or LAST_VALUE() . But Postgres does not support this functionality. Instead, you can use a two-level trick, where you assign a grouping for each value, so each NULL value is in the same group as the previous row with a value. Then "schmear" the values through the group:

select t.*,
       max(value_1) over (partition by sensor_group_id, grp_1) as imputed_value_1,
       max(value_2) over (partition by sensor_group_id, grp_2) as imputed_value_2,
       max(value_3) over (partition by sensor_group_id, grp_3) as imputed_value_3
from (select t.*,
             count(value_1) over (partition by sensor_group_id order by ts) as grp_1,
             count(value_2) over (partition by sensor_group_id order by ts) as grp_2,
             count(value_3) over (partition by sensor_group_id order by ts) as grp_3
      from t
     ) t;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM