[英]How to exclude record which creates cycle in data on same start_dt?
To identify cycle I can do select * from input A join input B on A.prv=B.cur and A.cur=B.prv
but How to keep only one record in cycle having same start_dt?要识别周期,我可以做
select * from input A join input B on A.prv=B.cur and A.cur=B.prv
但是如何在具有相同 start_dt 的周期中只保留一条记录? Except Prv and curr all columns are same for these records.I am using spark sql/Hive除了 Prv 和 curr 这些记录的所有列都是相同的。我正在使用 spark sql/Hive
Input
prv cur start_dt
A B 2099-12-31
B A 2099-12-31
P Q 2018-12-31
Q P 2018-12-31
Output (any of one record in cycle)
prv cur start_dt
A B 2099-12-31
P Q 2018-12-31
If you don't have Prv = Cur
record (such as A, A, 2099-12-31
, then you could use:如果您没有
Prv = Cur
记录(例如A, A, 2099-12-31
,那么您可以使用:
SELECT * FROM input A
JOIN input B ON A.prv=B.cur AND A.cur=B.prv
WHERE A.prv > B.prv
Based on your input data, you can do:根据您的输入数据,您可以执行以下操作:
select i.*
from input i
where i.prv < i.cur;
More generally, if you do not have duplicates for all pairs:更一般地说,如果您没有所有对的重复项:
select i.*
from input i
where i.prv < i.cur
union all
select i.*
from input i
where i.prv > i.cur and
not exists (select 1
from input i2
where i2.prv = i.cur and
i2.cur = i.prv and
i2.start_dt = i.start_dt
);
Or, you can use row_number()
:或者,您可以使用
row_number()
:
select i.*
from (select i.*
row_number() over (partition by start_dt, least(prv, cur), greatest(prv, cur) order by start_dt) as seqnum
from input i
) i
where seqnum = 1;
This might be the most efficient method in Hive.这可能是 Hive 中最有效的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.