I need help deduplicating a list of users (20 million+) across a different set of IDs.
Here's how it looks like :
- We have 3 kinds of userIDs : ID1, ID2 and ID3. - At least 2 of them are always sent together : ID1 with ID2 or ID2 with ID3. ID3 is never sent with ID1.
- Users can have several ID1, ID2 or ID3.
- So sometimes, in my table, I will have several lines with lots of different IDs, but it's possible that all of those can describe one single user.
All those IDs show one single user.
I'm thinking I could add a fourthID (GroupID), that would be the one deduplicating them. A bit like this :
Problem is : I know how to do this on SQL Server through the CURSOR / OPEN / FETCH / NEXT commands, but I only have Hive QL, Impala and Python available on my environment.
Would anyone know what would be the best way to approach this ?
Thanks a million times,
Hugo
Based on your example, supposing id2 always exist, you can aggregate rows, group by id2:
select max(id1) id1, id2, max(id3) id3 from
( --your dataset as in example
select 'A' as id1, 1 as id2, null as id3 union all
select null as id1, 1 as id2, 'Alpha' as id3 union all
select null as id1, 2 as id2, 'Beta' as id3 union all
select 'A' as id1, 2 as id2, null as id3
)s
group by id2;
OK
A 1 Alpha
A 2 Beta
Time taken: 58.739 seconds, Fetched: 2 row(s)
And now I'm trying to implement your logic as you described:
select --pass2
id1, id2, id3,
case when lag(id2) over (order by id2, GroupId) = id2 then lag(GroupId) over (order by id2, GroupId) else GroupId end GroupId2
from
(
select --pass1
id1, id2, id3,
case when
lag(id1) over(order by id1, NVL(ID1,ID3)) =id1 then lag(NVL(ID1,ID3)) over(order by id1, NVL(ID1,ID3)) else NVL(ID1,ID3) end GroupId
from
( --your dataset as in example
select 'A' as id1, 1 as id2, null as id3 union all
select null as id1, 1 as id2, 'Alpha' as id3 union all
select null as id1, 2 as id2, 'Beta' as id3 union all
select 'A' as id1, 2 as id2, null as id3
)s
)s --pass1
;
OK
id1 id2 id3 groupid2
A 1 NULL A
NULL 1 Alpha A
A 2 NULL A
NULL 2 Beta A
Time taken: 106.944 seconds, Fetched: 4 row(s)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.