简体   繁体   English

在 Hive 查询中获取分区组的最后一个值,但有额外的要求

[英]Get the last value of partition group in Hive query, but with additional requirements

Say I've got 3 columns in a table: id, flag, time.假设我在一个表中有 3 列:id、flag、time。 Flag can only be one of the three: A1, A2, B. Flag 只能是以下三个之一:A1、A2、B。

ID  flag    time
1   A1  2016-01-01
1   A2  2016-01-02
1   B   2016-01-03
1   B   2016-01-04
2   A1  2016-01-02
2   B   2016-01-03
2   A2  2016-01-04
2   B   2016-01-05

The data has been sorted by time for each ID.数据已按每个 ID 的时间排序。 Now I'd like to get, for each ID, when the flag equals B, the last non-B flag, eg:现在我想得到,对于每个 ID,当标志等于 B 时,最后一个非 B 标志,例如:

1   B   2016-01-03  A2  2016-01-02
1   B   2016-01-04  A2  2016-01-02
2   B   2016-01-03  A1  2016-01-02
2   B   2016-01-05  A2  2016-01-04

Is this even possible in a Hive query?这在 Hive 查询中甚至可能吗?

Use max window function to get the running maximum time for non B flags.使用max窗口函数获取非 B 标志的运行最长时间。 Then join this result to the original table to get the flag information for the corresponding max time (before flag B for a given id).然后join这个结果与原始表以获取相应的最大时间标志信息(标志B之前,对于给定的ID)。

SELECT X.*,
       T.FLAG
FROM
 (SELECT T.*,
  MAX(CASE WHEN FLAG<>'B' THEN TIME END) OVER(PARTITION BY ID ORDER BY TIME) AS MAX_TIME_BEFORE_B
  FROM T
 ) X
JOIN T ON T.ID=X.ID AND T.TIME=X.MAX_TIME_BEFORE_B
WHERE X.FLAG='B'

Sample Demo

select  id
       ,flag
       ,time
       ,A.flag as A_flag
       ,A.time as A_time

from   (select  id
               ,flag
               ,time

               ,max
                (
                    case 
                        when flag <> 'B' 
                        then named_struct ('time',time,'flag',flag) 
                    end
                ) over
                (   
                    partition by    id 
                    order by        time 
                    rows            unbounded preceding
                )  as A

        from    t
        ) t

where   flag = 'B'
;

+----+------+------------+--------+------------+
| id | flag |    time    | a_flag |   a_time   |
+----+------+------------+--------+------------+
|  1 | B    | 2016-01-03 | A2     | 2016-01-02 |
|  1 | B    | 2016-01-04 | A2     | 2016-01-02 |
|  2 | B    | 2016-01-03 | A1     | 2016-01-02 |
|  2 | B    | 2016-01-05 | A2     | 2016-01-04 |
+----+------+------------+--------+------------+

Ps ps

  • I would recommend not to use what might be a reserved word ( time ) as column name.我建议不要使用可能是保留字( time )作为列名。
  • I would recommend not to use undescriptive names such as time for date column.我建议不要使用不具描述性的名称,例如日期列的time

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM