简体   繁体   English

如何排除在同一 start_dt 上的数据中创建循环的记录?

[英]How to exclude record which creates cycle in data on same start_dt?

To identify cycle I can do select * from input A join input B on A.prv=B.cur and A.cur=B.prv but How to keep only one record in cycle having same start_dt?要识别周期,我可以做select * from input A join input B on A.prv=B.cur and A.cur=B.prv但是如何在具有相同 start_dt 的周期中只保留一条记录? Except Prv and curr all columns are same for these records.I am using spark sql/Hive除了 Prv 和 curr 这些记录的所有列都是相同的。我正在使用 spark sql/Hive

Input

prv  cur  start_dt
A     B   2099-12-31
B     A   2099-12-31
P     Q   2018-12-31
Q     P   2018-12-31

Output (any of one record in cycle)
prv  cur  start_dt
A     B   2099-12-31
P     Q   2018-12-31

If you don't have Prv = Cur record (such as A, A, 2099-12-31 , then you could use:如果您没有Prv = Cur记录(例如A, A, 2099-12-31 ,那么您可以使用:

    SELECT * FROM input A 
    JOIN input B ON A.prv=B.cur AND A.cur=B.prv
    WHERE A.prv > B.prv

Based on your input data, you can do:根据您的输入数据,您可以执行以下操作:

select i.*
from input i
where i.prv < i.cur;

More generally, if you do not have duplicates for all pairs:更一般地说,如果您没有所有对的重复项:

select i.*
from input i
where i.prv < i.cur
union all
select i.*
from input i
where i.prv > i.cur and
      not exists (select 1
                  from input i2
                  where i2.prv = i.cur and
                        i2.cur = i.prv and
                        i2.start_dt = i.start_dt 
                 );

Or, you can use row_number() :或者,您可以使用row_number()

select i.*
from (select i.*
             row_number() over (partition by start_dt, least(prv, cur), greatest(prv, cur) order by start_dt) as seqnum
      from input i
     ) i
where seqnum = 1;

This might be the most efficient method in Hive.这可能是 Hive 中最有效的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM