简体   繁体   English

使用Hive QL / Impala / Python对ID进行重复数据删除

[英]Deduplicating IDs with Hive QL / Impala / Python

I need help deduplicating a list of users (20 million+) across a different set of IDs. 我需要帮助来删除一组不同ID的用户列表(超过2000万)的重复数据。

Here's how it looks like : 看起来像这样:
- We have 3 kinds of userIDs : ID1, ID2 and ID3. -我们提供3种用户ID:ID1,ID2和ID3。 - At least 2 of them are always sent together : ID1 with ID2 or ID2 with ID3. -它们中的至少2个总是一起发送:ID1与ID2或ID2与ID3。 ID3 is never sent with ID1. ID3永远不会与ID1一起发送。
- Users can have several ID1, ID2 or ID3. -用户可以具有多个ID1,ID2或ID3。
- So sometimes, in my table, I will have several lines with lots of different IDs, but it's possible that all of those can describe one single user. -有时候,在我的表中,我会有几行带有许多不同的ID,但是所有这些ID都可能描述一个用户。

An example : 一个例子 :
在此处输入图片说明

All those IDs show one single user. 所有这些ID均显示一个用户。

I'm thinking I could add a fourthID (GroupID), that would be the one deduplicating them. 我在想我可以添加一个FourthID(GroupID),这将是对它们进行重复数据删除的一个。 A bit like this : 有点像这样:

在此处输入图片说明

Problem is : I know how to do this on SQL Server through the CURSOR / OPEN / FETCH / NEXT commands, but I only have Hive QL, Impala and Python available on my environment. 问题是:我知道如何通过CURSOR / OPEN / FETCH / NEXT命令在SQL Server上执行此操作,但是我的环境中只有Hive QL,Impala和Python。

Would anyone know what would be the best way to approach this ? 有谁知道解决这个问题的最佳方法是什么?

Thanks a million times, 谢谢一百万次

Hugo 雨果

Based on your example, supposing id2 always exist, you can aggregate rows, group by id2: 根据您的示例,假设id2始终存在,则可以按id2分组汇总行:

select max(id1) id1,  id2, max(id3) id3 from
( --your dataset as in example
 select 'A'  as id1, 1 as id2,  null   as id3 union all
 select null as id1, 1 as id2, 'Alpha' as id3 union all
 select null as id1, 2 as id2, 'Beta'  as id3 union all
 select 'A'  as id1, 2 as id2,  null   as id3
 )s
 group by id2;

OK
A       1       Alpha
A       2       Beta
Time taken: 58.739 seconds, Fetched: 2 row(s)

And now I'm trying to implement your logic as you described: 现在,我正在尝试实现您所描述的逻辑:

select --pass2
 id1, id2, id3,
 case when lag(id2) over (order by id2, GroupId) = id2 then lag(GroupId) over (order by id2, GroupId) else GroupId end GroupId2
 from
 (
 select        --pass1
 id1, id2, id3,
 case when 
 lag(id1) over(order by id1, NVL(ID1,ID3)) =id1 then lag(NVL(ID1,ID3))  over(order by id1, NVL(ID1,ID3)) else NVL(ID1,ID3) end GroupId
 from
( --your dataset as in example
 select 'A'  as id1, 1 as id2,  null   as id3 union all
 select null as id1, 1 as id2, 'Alpha' as id3 union all
 select null as id1, 2 as id2, 'Beta'  as id3 union all
 select 'A'  as id1, 2 as id2,  null   as id3
 )s
 )s --pass1
;


OK
id1     id2     id3     groupid2
A       1       NULL    A
NULL    1       Alpha   A
A       2       NULL    A
NULL    2       Beta    A
Time taken: 106.944 seconds, Fetched: 4 row(s)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM