简体   繁体   English

使用hive的merge语句将delta数据合并到外部表中

[英]Merge delta data into an external table using hive's merge statement

I have an external table mapped in Hive (v2.3.2 on EMR-5.11.0) that I need to update with new data around once a week. 我有一个映射在Hive(EMR-5.11.0上的v2.3.2)的外部表,我需要每周更新一次新数据。 The merge consists of a conditional upsert statement. 合并包含条件upsert语句。

The table's location is in s3, and the data is always there (created once, and we just need to update it with new data). 表的位置在s3中,数据始终存在(创建一次,我们只需要用新数据更新它)。

I've read this blog about merging data in Hive using the ACID feature on transactional tables ( https://dzone.com/articles/update-hive-tables-the-easy-way-part-2-hortonworks ), but as far as I can see, the only solution is to copy my external table to a temporary Hive internal table, that is clustered and transactional, then only on that table I can do the merge and override my original data with the new merged one. 我已经阅读了这篇博客,关于在事务表( https://dzone.com/articles/update-hive-tables-the-easy-way-part-2-hortonworks )上使用ACID功能在Hive中合并数据,但是我可以看到,唯一的解决方案是将我的外部表复制到临时Hive内部表,即集群和事务,然后只在该表上我可以进行合并并使用新的合并后覆盖我的原始数据。

This table is quite big (around 10GB of data), so I'd like to avoid copying it before every merge operation. 这个表非常大(大约10GB的数据),所以我想避免在每次合并操作之前复制它。

Is there a way to create the internal table and have it mapped to existing data? 有没有办法创建内部表并将其映射到现有数据? or is there another way, besides the merge statement, to perform an upsert on Hive external tables? 还是有另一种方法,除了合并语句,在Hive外部表上执行upsert?

Thanks a lot in advance! 非常感谢提前!

if you want to use merge for upsert, then you have to have data in managed hive internal table, simple solution if you don't want to copy file 如果你想使用merge for upsert,那么你必须在托管的hive内部表中有数据,如果你不想复制文件那么简单的解决方案

  1. create similar structure internal table 创建类似结构的内部表

    create table table1( id int, Name string, create table table1(id int,Name string,
    UpdatedDate date ) CLUSTERED BY (id) INTO 2 BUCKETS STORED AS ORC tblproperties("transactional"="true","skip.header.line.count"="1"); UpdatedDate date)CLUSTERED BY(id)INTO 2 BUCKETS STORED ORC tblproperties(“transactional”=“true”,“skip.header.line.count”=“1”);

  2. insert data into internal table from external table 从外部表将数据插入内部表

    INSERT INTO TABLE table1 SELECT * FROM your external table; INSERT INTO TABLE table1 SELECT * FROM外部表;

  3. run your merge statement to update internal table 运行merge语句以更新内部表

    set hive.support.concurrency = true; set hive.support.concurrency = true; set hive.enforce.bucketing = true; set hive.enforce.bucketing = true; set hive.exec.dynamic.partition.mode = nonstrict; set hive.exec.dynamic.partition.mode = nonstrict; set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.compactor.initiator.on = true; set hive.compactor.initiator.on = true; set hive.compactor.worker.threads = 1; set hive.compactor.worker.threads = 1; merge into table1 using DailyFeed on DailyFeed.id= table1.id when matched and table1.name<>DailyFeed.name then update set name= DailyFeed.name when not matched then insert values (DailyFeed.id, DailyFeed.name, CURRENT_DATE) ; 在DailyFeed.id = table1.id上使用DailyFeed并在table1name <> DailyFeed.name然后更新set name = DailyFeed.name,然后插入值(DailyFeed.id,DailyFeed.name,CURRENT_DATE);

  4. overwrite your base file export internal table 覆盖基本文件导出内部表

hive -e 'set hive.cli.print.header=true; hive -e'set hive.cli.print.header = true; select * from table1' | select * from table1'| sed 's/[[:space:]]+/,/g' > /home/user1/table1.csv sed's / [[:space:]] + /,/ g'> /home/user1/table1.csv

Or 要么

insert overwrite local directory '/home/user1/table1.csv' row format delimited fields terminated by ',' SELECT * FROM table1;

Hope this will help to resolve your issue 希望这有助于解决您的问题

Thanks 谢谢

Nilesh Nilesh制作

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM