使用Hive QL / Impala / Python对ID进行重复数据删除

Question

I need help deduplicating a list of users (20 million+) across a different set of IDs. 我需要帮助来删除一组不同ID的用户列表（超过2000万）的重复数据。

Here's how it looks like : 看起来像这样：
- We have 3 kinds of userIDs : ID1, ID2 and ID3. -我们提供3种用户ID：ID1，ID2和ID3。 - At least 2 of them are always sent together : ID1 with ID2 or ID2 with ID3. -它们中的至少2个总是一起发送：ID1与ID2或ID2与ID3。 ID3 is never sent with ID1. ID3永远不会与ID1一起发送。
- Users can have several ID1, ID2 or ID3. -用户可以具有多个ID1，ID2或ID3。
- So sometimes, in my table, I will have several lines with lots of different IDs, but it's possible that all of those can describe one single user. -有时候，在我的表中，我会有几行带有许多不同的ID，但是所有这些ID都可能描述一个用户。

An example : 一个例子：

All those IDs show one single user. 所有这些ID均显示一个用户。

I'm thinking I could add a fourthID (GroupID), that would be the one deduplicating them. 我在想我可以添加一个FourthID（GroupID），这将是对它们进行重复数据删除的一个。 A bit like this : 有点像这样：

Problem is : I know how to do this on SQL Server through the CURSOR / OPEN / FETCH / NEXT commands, but I only have Hive QL, Impala and Python available on my environment. 问题是：我知道如何通过CURSOR / OPEN / FETCH / NEXT命令在SQL Server上执行此操作，但是我的环境中只有Hive QL，Impala和Python。

Would anyone know what would be the best way to approach this ? 有谁知道解决这个问题的最佳方法是什么？

Thanks a million times, 谢谢一百万次

Hugo 雨果

Answer 1

Based on your example, supposing id2 always exist, you can aggregate rows, group by id2: 根据您的示例，假设id2始终存在，则可以按id2分组汇总行：

select max(id1) id1,  id2, max(id3) id3 from
( --your dataset as in example
 select 'A'  as id1, 1 as id2,  null   as id3 union all
 select null as id1, 1 as id2, 'Alpha' as id3 union all
 select null as id1, 2 as id2, 'Beta'  as id3 union all
 select 'A'  as id1, 2 as id2,  null   as id3
 )s
 group by id2;

OK
A       1       Alpha
A       2       Beta
Time taken: 58.739 seconds, Fetched: 2 row(s)

And now I'm trying to implement your logic as you described: 现在，我正在尝试实现您所描述的逻辑：

select --pass2
 id1, id2, id3,
 case when lag(id2) over (order by id2, GroupId) = id2 then lag(GroupId) over (order by id2, GroupId) else GroupId end GroupId2
 from
 (
 select        --pass1
 id1, id2, id3,
 case when 
 lag(id1) over(order by id1, NVL(ID1,ID3)) =id1 then lag(NVL(ID1,ID3))  over(order by id1, NVL(ID1,ID3)) else NVL(ID1,ID3) end GroupId
 from
( --your dataset as in example
 select 'A'  as id1, 1 as id2,  null   as id3 union all
 select null as id1, 1 as id2, 'Alpha' as id3 union all
 select null as id1, 2 as id2, 'Beta'  as id3 union all
 select 'A'  as id1, 2 as id2,  null   as id3
 )s
 )s --pass1
;


OK
id1     id2     id3     groupid2
A       1       NULL    A
NULL    1       Alpha   A
A       2       NULL    A
NULL    2       Beta    A
Time taken: 106.944 seconds, Fetched: 4 row(s)

使用Hive QL / Impala / Python对ID进行重复数据删除

问题描述

1 个解决方案

解决方案1
0 2018-03-21 11:20:42

使用Hive QL / Impala / Python对ID进行重复数据删除

问题描述

1 个解决方案

解决方案1 0 2018-03-21 11:20:42

解决方案1
0 2018-03-21 11:20:42