简体   繁体   English

Hive-如何跟踪和更新增量表中Hive中的上次修改日期?

[英]Hive - How to track and update Last Modified date in Hive for delta tables?

I have a use case where the source table in Hive is updated daily in such a way that the entire data is refreshed. 我有一个用例,其中Hive中的源表每天更新,以使整个数据刷新。 On day one, we ingest the entire table, but from day two onwards, we are only interested in those rows whose "Last Modified Date" has been updated to reflect the previous day's date. 在第一天,我们提取了整个表,但是从第二天开始,我们只对“上次修改日期”已更新为反映前一天日期的那些行感兴趣。

The proposed solution is to store the MAX of the Last Modified Date on day 1 and on day 2, compare all rows whose Last Modified Date is greater than the the stored date, and process only those rows. 建议的解决方案是在第1天和第2天存储上次修改日期的最大值,比较上次修改日期大于存储日期的所有行,并仅处理这些行。

What is the best way of generating, storing and retrieving this Last Modified Date on a daily basis? 每天生成,存储和检索此最后修改日期的最佳方法是什么? Also, different tables will have different dates and ideally, I'd like something which has a Table_Name, Last_Modified_Date , unless there is a better way of doing it. 另外,不同的表将具有不同的日期,并且理想情况下,我想要一个具有Table_Name, Last_Modified_Date ,除非有更好的方法可以这样做。

Please help. 请帮忙。 Thank you. 谢谢。

If I understood your scenario correctly, at new daily run the value of Last_Modified_Date can only be greater then maximum of Last_Modified_Date at previous run. 如果我正确理解了您的情况,那么在新的每日运行中,Last_Modified_Date的值只能大于上一次运行时的Last_Modified_Date的最大值。

In such case I would suggest to create table partitioning on Last_Modified_Date and process only those records which fall into this partition (which is much more faster then to process your comparison). 在这种情况下,我建议在Last_Modified_Date上创建表分区,并仅处理属于该分区的那些记录(这比处理您的比较要快得多)。

Is this solution possible? 这种解决方案可行吗?

  1. Extract date from "Last Modified Date" as a new column, named dateid; 从“上次修改日期”中提取日期作为新列,命名为dateid; Use dateid as partition key. 使用dateid作为分区键。
  2. When refresh entire data, you can split all data into different partition;(this action can be realized by dynamic partition feature of hive). 刷新整个数据时,可以将所有数据拆分到不同的分区中(此操作可以通过配置单元的动态分区功能来实现)。
  3. Then you can process the data in last dateid, if you process data day by day. 然后,如果您每天都在处理数据,则可以处理最后一个dateid中的数据。

After much brainstorming, we decided to settle on using an intermediate table to store the MAX of the Last Modified Date with the table name and using that as a lookup to determine the new records to be processed. 经过大量的头脑风暴,我们决定使用中间表存储带有表名的“ Last Modified Date”的MAX,然后使用该表作为查找来确定要处理的新记录。 Since we are using shell scripts, it occured to me that I can use a variable to query the table and get the Last Modified Date and then use that variable to process the new/updated records. 由于我们使用的是Shell脚本,因此我想到可以使用一个变量来查询表并获取上次修改日期,然后使用该变量来处理新记录/更新记录。

describe formatted table_name ... You will get transient_lastDdlTime which you can convert with following. 描述格式化的table_name ...您将获得transient_lastDdlTime,可以使用以下转换。

SELECT CAST(from_unixtime(your_transient_lastDdlTime_value) AS timestamp); SELECT CAST(from_unixtime(your_transient_lastDdlTime_value)AS时间戳);

Thanks & Regards, Kamleshkumar Gujarathi 感谢与问候,Kamleshkumar Gujarathi

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM