简体   繁体   中英

Hive - How to track and update Last Modified date in Hive for delta tables?

I have a use case where the source table in Hive is updated daily in such a way that the entire data is refreshed. On day one, we ingest the entire table, but from day two onwards, we are only interested in those rows whose "Last Modified Date" has been updated to reflect the previous day's date.

The proposed solution is to store the MAX of the Last Modified Date on day 1 and on day 2, compare all rows whose Last Modified Date is greater than the the stored date, and process only those rows.

What is the best way of generating, storing and retrieving this Last Modified Date on a daily basis? Also, different tables will have different dates and ideally, I'd like something which has a Table_Name, Last_Modified_Date , unless there is a better way of doing it.

Please help. Thank you.

If I understood your scenario correctly, at new daily run the value of Last_Modified_Date can only be greater then maximum of Last_Modified_Date at previous run.

In such case I would suggest to create table partitioning on Last_Modified_Date and process only those records which fall into this partition (which is much more faster then to process your comparison).

Is this solution possible?

  1. Extract date from "Last Modified Date" as a new column, named dateid; Use dateid as partition key.
  2. When refresh entire data, you can split all data into different partition;(this action can be realized by dynamic partition feature of hive).
  3. Then you can process the data in last dateid, if you process data day by day.

After much brainstorming, we decided to settle on using an intermediate table to store the MAX of the Last Modified Date with the table name and using that as a lookup to determine the new records to be processed. Since we are using shell scripts, it occured to me that I can use a variable to query the table and get the Last Modified Date and then use that variable to process the new/updated records.

describe formatted table_name ... You will get transient_lastDdlTime which you can convert with following.

SELECT CAST(from_unixtime(your_transient_lastDdlTime_value) AS timestamp);

Thanks & Regards, Kamleshkumar Gujarathi

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM