简体   繁体   中英

Get the last value of partition group in Hive query, but with additional requirements

Say I've got 3 columns in a table: id, flag, time. Flag can only be one of the three: A1, A2, B.

ID  flag    time
1   A1  2016-01-01
1   A2  2016-01-02
1   B   2016-01-03
1   B   2016-01-04
2   A1  2016-01-02
2   B   2016-01-03
2   A2  2016-01-04
2   B   2016-01-05

The data has been sorted by time for each ID. Now I'd like to get, for each ID, when the flag equals B, the last non-B flag, eg:

1   B   2016-01-03  A2  2016-01-02
1   B   2016-01-04  A2  2016-01-02
2   B   2016-01-03  A1  2016-01-02
2   B   2016-01-05  A2  2016-01-04

Is this even possible in a Hive query?

Use max window function to get the running maximum time for non B flags. Then join this result to the original table to get the flag information for the corresponding max time (before flag B for a given id).

SELECT X.*,
       T.FLAG
FROM
 (SELECT T.*,
  MAX(CASE WHEN FLAG<>'B' THEN TIME END) OVER(PARTITION BY ID ORDER BY TIME) AS MAX_TIME_BEFORE_B
  FROM T
 ) X
JOIN T ON T.ID=X.ID AND T.TIME=X.MAX_TIME_BEFORE_B
WHERE X.FLAG='B'

Sample Demo

select  id
       ,flag
       ,time
       ,A.flag as A_flag
       ,A.time as A_time

from   (select  id
               ,flag
               ,time

               ,max
                (
                    case 
                        when flag <> 'B' 
                        then named_struct ('time',time,'flag',flag) 
                    end
                ) over
                (   
                    partition by    id 
                    order by        time 
                    rows            unbounded preceding
                )  as A

        from    t
        ) t

where   flag = 'B'
;

+----+------+------------+--------+------------+
| id | flag |    time    | a_flag |   a_time   |
+----+------+------------+--------+------------+
|  1 | B    | 2016-01-03 | A2     | 2016-01-02 |
|  1 | B    | 2016-01-04 | A2     | 2016-01-02 |
|  2 | B    | 2016-01-03 | A1     | 2016-01-02 |
|  2 | B    | 2016-01-05 | A2     | 2016-01-04 |
+----+------+------------+--------+------------+

Ps

  • I would recommend not to use what might be a reserved word ( time ) as column name.
  • I would recommend not to use undescriptive names such as time for date column.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM