简体   繁体   English

更新配置单元表中的增量记录

[英]Update delta records in hive table

I have a table with history data which is more than a TB size and I would be receiving delta (updated info) records on daily basis which will be in GB size and stored in delta table. 我有一个表,其中的历史数据大于TB大小,我将每天接收增量(更新的信息)记录,这些记录将以GB大小存储在增量表中。 Now I want to compare the delta records with the history records and update the History table with the latest data from Delta table. 现在,我想将增量记录与历史记录进行比较,并使用增量表中的最新数据更新历史记录表。

What is the best approach to do this in Hive since I would be dealing with millions of rows. 在Hive中执行此操作的最佳方法是什么,因为我将要处理数百万行。 I have searched the web and found the below approach. 我在网上搜索后发现以下方法。

http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive

But I don't think it would a be best approach in the aspect of performance. 但是我认为这不是性能方面的最佳方法。

In Latest hive (0.14), you can do updates. 在最新配置单元(0.14)中,您可以进行更新。 You need to keep the table in ORC format and bucket by the searching key. 您需要通过搜索键将表保持为ORC格式和存储桶。

Oh, and I need to add this link for more information: Hive Transactions 哦,我需要添加此链接以获取更多信息: Hive Transactions

In addition: Do you have a good partitioning key so that the updates will only have to work on latest partitions? 另外:您是否拥有良好的分区键,以便更新仅在最新的分区上有效? it can be good to do the following: 最好执行以下操作:

  1. get data from required partitions to a temp table (T1) 从所需分区中获取数据到临时表(T1)

  2. let's say T2 is the new table with update records. 假设T2是带有更新记录的新表。 need to be partitioned the same way as T1 需要以与T1相同的方式进行分区

  3. Join T1 and T2 with key(s) and take the ones only present in T1 and not in T2. 用密钥将T1和T2连接起来,并获取仅存在于T1中而不存在于T2中的密钥。 Let's say this table is T3 假设这张桌子是T3
  4. Union T2 and T3 to create table T4 联合T2和T3创建表T4
  5. Drop the previously taken partitions from T1 从T1删除先前获取的分区
  6. Insert T4 into T1 将T4插入T1

Remember, the operations may not be atomic and during the time step 5 and 6 happens, any query running on T1 can have intermediate results. 请记住,这些操作可能不是原子操作,并且在发生第5步和第6步期间,在T1上运行的任何查询都可以得到中间结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM