Say I've got 3 columns in a table: id, flag, time. Flag can only be one of the three: A1, A2, B.
ID flag time
1 A1 2016-01-01
1 A2 2016-01-02
1 B 2016-01-03
1 B 2016-01-04
2 A1 2016-01-02
2 B 2016-01-03
2 A2 2016-01-04
2 B 2016-01-05
The data has been sorted by time for each ID. Now I'd like to get, for each ID, when the flag equals B, the last non-B flag, eg:
1 B 2016-01-03 A2 2016-01-02
1 B 2016-01-04 A2 2016-01-02
2 B 2016-01-03 A1 2016-01-02
2 B 2016-01-05 A2 2016-01-04
Is this even possible in a Hive query?
Use max
window function to get the running maximum time for non B flags. Then join
this result to the original table to get the flag information for the corresponding max time (before flag B for a given id).
SELECT X.*,
T.FLAG
FROM
(SELECT T.*,
MAX(CASE WHEN FLAG<>'B' THEN TIME END) OVER(PARTITION BY ID ORDER BY TIME) AS MAX_TIME_BEFORE_B
FROM T
) X
JOIN T ON T.ID=X.ID AND T.TIME=X.MAX_TIME_BEFORE_B
WHERE X.FLAG='B'
select id
,flag
,time
,A.flag as A_flag
,A.time as A_time
from (select id
,flag
,time
,max
(
case
when flag <> 'B'
then named_struct ('time',time,'flag',flag)
end
) over
(
partition by id
order by time
rows unbounded preceding
) as A
from t
) t
where flag = 'B'
;
+----+------+------------+--------+------------+
| id | flag | time | a_flag | a_time |
+----+------+------------+--------+------------+
| 1 | B | 2016-01-03 | A2 | 2016-01-02 |
| 1 | B | 2016-01-04 | A2 | 2016-01-02 |
| 2 | B | 2016-01-03 | A1 | 2016-01-02 |
| 2 | B | 2016-01-05 | A2 | 2016-01-04 |
+----+------+------------+--------+------------+
Ps
time
) as column name.time
for date column.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.