简体繁体中英

Best practice of ETL using Spring Batch?

原文 2013-09-26 09:58:08 5 1 java/ offline/ etl

I am using Spring Batch to extract-transform-load massive online data into a data warehouse for recommendation analysis. Both are RDBMS.

My question is, what's the best practice for offline Spring Batch ETL? Full Load or Incremental Load? I prefer Full Load because it's simpler. Currently I'm using these steps for the data-loading job:

step1: truncate table A in data warehouse;
step2: load data into table A;
step3: truncate table B in data warehouse;
step4: load data into table B;
step5: truncate table C in data warehouse;
step6: load data into table C;
...

those tables A , B , C , ... in data warehouse are used by real-time recommendation system processing.

But since the data I load from online db is massive, the entire job processing will be very time-consuming. So if I truncate a table and haven't loaded data to it yet, the real-time recommendation processing that rely on this table will have a big problem. How can I prevent this data incompleteness from happening? Using Staging Table, or some strategy like that?

Any reply will be greatly appreciated.

1 answers

You have a couple of options:

Use an audit log on the source tables to determine which records need to be updated in the target. This is the best option for batch ETL, but it requires having audit logging turned on in the source system. If you have the ability to turn auditing on and it's not going to be a performance problem, that's the way to go.
If there are no deletes (only inserts and updates) in the source table, you can simply do a full read/write from target to source, using chunks of records.
Depending on the target database engine, you will have different options for doing the updates. Some may require that you attempt to perform a write attempt (either an insert or an update); if it fails, you have to catch the exception and perform the other write. (For example, attempt to insert. If you catch a DuplicateKeyException , then you have to do an update instead. Depending on the ratio of inserts to updates, you can reverse the order of insert/update to update/insert).
Other engines allow MERGE, which allows an update/insert/delete in one step.
This approach still moves a lot of data, but will have the minimum impact on the target. This assumes, of course, that you're able to order your table updates in such a manner that you don't have referential integrity problems. write to the target as you read.

Whats the best practice around when to start a new session or transaction for batch jobs using spring/hibernate?

Best practice for when to start a new session/transaction for batch jobs using spring/hibernate and when to commit/flush the session?

Best practice for using date parameters in Spring controllers?

3 tiers architecture best practice using Spring Boot

Best practice to select data using Spring JdbcTemplate

Spring Webflow Best Practice

What is the best practice for transaction management when using Hibernate and Spring together?

Spring data internationalization best practice

Best practice on Spring Security for this requirement

Spring Security UserDetails Best Practice

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Whats the best practice around when to start a new session or transaction for batch jobs using spring/hibernate? Best practice for when to start a new session/transaction for batch jobs using spring/hibernate and when to commit/flush the session? Best practice for using date parameters in Spring controllers? 3 tiers architecture best practice using Spring Boot Best practice to select data using Spring JdbcTemplate Spring Webflow Best Practice What is the best practice for transaction management when using Hibernate and Spring together? Spring data internationalization best practice Best practice on Spring Security for this requirement Spring Security UserDetails Best Practice

Related Tags

Best practice of ETL using Spring Batch?

Question

1 answers

solution1 0 ACCPTED 2013-10-01 15:59:01

solution1
0 ACCPTED 2013-10-01 15:59:01