简体   繁体   English

Hive / Hadoop中独特的密钥生成

[英]Unique Key generation in Hive/Hadoop

While selecting a set of records from a big data hive table, a unique key needs to be created for each record. 从大数据配置单元表中选择一组记录时,需要为每个记录创建唯一的键。 In a sequential mode of operation , it is easy to generate unique id by calling soem thing like max(id). 在顺序操作模式下,通过调用诸如max(id)之类的东西很容易生成唯一的id。 Since hive runs the task in parallel, how can we generate unique key as part of a select query, without compromising the performance of hadoop. 由于hive并行运行任务,因此我们如何在不影响hadoop性能的情况下生成唯一键作为select查询的一部分。 Is this really a map reduce problem or do we need to go for a sequential approach to solve this. 这真的是一个地图缩小问题,还是我们需要采用一种顺序方法来解决这个问题。

If by some reason you do not want to deal with UUIDs, then this solution (based on numeric values) does not require your parallel units to "talk" to each other or synchronize whatsoever. 如果由于某种原因您不想处理UUID,则此解决方案(基于数字值)不需要并行单元相互“交谈”或同步。 Thus it is very efficient, but it does not guarantee that your integer keys are going to be continuous. 因此,它非常有效,但是不能保证整数键将是连续的。

If you have say N parallel units of execution, and you know your N, and each unit is assigned an ID from 0 to N - 1, then you can simply generate a unique integer across all units 如果您说N个并行执行单元,并且您知道N个,并且为每个单元分配了一个从0到N-1的ID,那么您只需在所有单元上生成唯一的整数

Unit #0:   0, N, 2N, 3N, ...
Unit #1:   1, N+1, 2N+1, 3N+1, ...
...
Unit #N-1: N-1, N+(N-1), 2N+(N-1), 3N+(N-1), ...

Depending on where you need to generate keys (mapper or reducer) you can get your N from hadoop configuration: 根据需要在何处生成密钥(映射器或化简器),可以从hadoop配置中获取N:

Mapper: mapred.map.tasks
Reduce: mapred.reduce.tasks

... and ID of your unit: In Java, it is: ...和您单位的ID:在Java中,它是:

 context.getTaskAttemptID().getTaskID().getId()

Not sure about Hive, but it should be possible as well. 不确定Hive,但也应该可行。

Use UUID instead of numbers. 使用UUID而不是数字。 It works in a true distributed way. 它以真正的分布式方式工作。

select reflect("java.util.UUID", "randomUUID")
SELECT T.*, ROW_NUMBER () OVER (ORDER BY T.C1) AS SEQ_NBR 
FROM TABLE T

Here C1 is any numeric column in T. This will generate a unique number for each record while selecting from table T, starting from 1. If this is one time activity then solution is fine. 这里C1是T中的任何数字列。当从表T中选择时,它将为每个记录生成一个唯一的数字(从1开始)。如果这是一次活动,则解决方案很好。

In case you need to repeat this process every day and insert this data into table T2 and generate unique id then you can try below method. 如果您需要每天重复此过程并将此数据插入表T2中并生成唯一ID,则可以尝试以下方法。

SELECT T.*, ROW_NUMBER () OVER (ORDER BY T.C1)+ SEQ_T2  AS SEQ_NBR 
FROM TABLE T, (SELECT MAX(SEQ) AS SEQ_T2 FROM TABLE T2)

Hope it helps !! 希望能帮助到你 !!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM