简体   繁体   中英

ETL process to transfer data from one Db to another using Apache Spark

I need to create an ETL process that will extract, tranform & then load 100+ tables from several instances of SQLServer to as many instances of Oracle in parallel on a daily basis. I understand that I can create multiple threads in Java to accomplish this but if all of them run on the same machine this approach won't scale. Another approach could be to get a bunch of ec2 instances & start transferring tables for each instance on a different ec2 instance. With this approach, though, I would have to take care of "elasticity" by adding/removing machines from my pool.

Somehow I think I can use "Apache Spark on Amazon EMR" to accomplish this, but in the past I've used Spark only to handle data on HDFS/Hive, so not sure if transferring data from one Db to another Db is a good use case for Spark - or - is it?

Starting from your last question: "Not sure if transferring data from one Db to another Db is a good use case for Spark" :

It is, within the limitation of the JDBC spark connector. There are some limitations such as the missing support in updates, and the parallelism when reading the table (requires splitting the table by a numeric column).

Considering the IO cost and the overall performance of RDBMS, running the jobs in a FIFO mode does not sound like a good idea. You can submit each one of the jobs with a configuration that requires 1/x of cluster resources so x tables will be processed in parallel.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM