简体   繁体   中英

What is the best way to structure a spark structured streaming pipeline?

I'm moving data from my postgres database to kafka and in the middle doing some transformations with spark. I Have 50 tables and for each table i have transformations totally different from the others. So, i want to know how is the best way to structure my spark structured streaming code. I think in three options:

  1. To Put all the logical of read and write this 50 tables in one object and call only this object.

  2. Create 50 different objects for each table and in a new object create a main method calling each of 50 objects and after call spark.streams.awaitAnyTermination()

  3. Submit individually each of these 50 objects via spark submit

If exist another better option, please talk to me.

Thank you

Creating single object as per your approach 1 does not look good. It will be difficult to understand and maintain.

Between step2 and step3, I would still prefer 3rd. Having separate jobs will be a bit of hassle to maintain (managing deployment and structuring out the common code), but if done well it will give us more flexibility. We could easily undeploy a single table if needed. Also any subsequent deployments or changes would mean deploying only the concerned table flows. The other existing table pipelines will keep working as it is.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM