简体   繁体   中英

Best practice of ETL using Spring Batch?

I am using Spring Batch to extract-transform-load massive online data into a data warehouse for recommendation analysis. Both are RDBMS.

My question is, what's the best practice for offline Spring Batch ETL? Full Load or Incremental Load? I prefer Full Load because it's simpler. Currently I'm using these steps for the data-loading job:

step1: truncate table A in data warehouse;
step2: load data into table A;
step3: truncate table B in data warehouse;
step4: load data into table B;
step5: truncate table C in data warehouse;
step6: load data into table C;
...

those tables A , B , C , ... in data warehouse are used by real-time recommendation system processing.

But since the data I load from online db is massive, the entire job processing will be very time-consuming. So if I truncate a table and haven't loaded data to it yet, the real-time recommendation processing that rely on this table will have a big problem. How can I prevent this data incompleteness from happening? Using Staging Table, or some strategy like that?

Any reply will be greatly appreciated.

You have a couple of options:

  • Use an audit log on the source tables to determine which records need to be updated in the target. This is the best option for batch ETL, but it requires having audit logging turned on in the source system. If you have the ability to turn auditing on and it's not going to be a performance problem, that's the way to go.

  • If there are no deletes (only inserts and updates) in the source table, you can simply do a full read/write from target to source, using chunks of records.

    Depending on the target database engine, you will have different options for doing the updates. Some may require that you attempt to perform a write attempt (either an insert or an update); if it fails, you have to catch the exception and perform the other write. (For example, attempt to insert. If you catch a DuplicateKeyException , then you have to do an update instead. Depending on the ratio of inserts to updates, you can reverse the order of insert/update to update/insert).

    Other engines allow MERGE, which allows an update/insert/delete in one step.

    This approach still moves a lot of data, but will have the minimum impact on the target. This assumes, of course, that you're able to order your table updates in such a manner that you don't have referential integrity problems. write to the target as you read.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM