简体   繁体   中英

Complex Data Pipeline Migration Plan Question

My plan:

  1. Move all data processing to Spark (PySpark preferably) with final output (consumer facing) data going to Redshift only. Spark seems to connect to all the various sources well (Dynamo DB, S3, Redshift). Output to Redshift/S3 etc depending on customer need. This avoids having multiple Redshift clusters, broken/overusing internal unsupported ETL tools, copy of the same data across clusters, views and tables etc (which is the current setup).
  2. Use Luigi to build a web UI to daily monitor pipelines and visualise the dependency tree, and schedule ETL's. Email notifications should be an option for failures also. An alternative is AWS data pipeline, but, Luigi seems to have a better UI for what is happening where many dependencies are involved (some trees are 5 levels deep, but perhaps this can also be avoided with better Spark code).

Questions:

  1. Does Luigi integrate with Spark (I have only used PySpark before, not Luigi, so this is a learning curve for me). The plan was to schedule 'applications' and Spark actually has ETL too I believe, so unsure how Luigi integrates here?
  2. How to account for the fact that some pipelines may be 'real time' - would I need to spin up the Spark / EMR job hourly for example then?

I'm open to thoughts / suggestions / better ways of doing this too!

To answer your questions directly,

1) Yes, Luigi does play nicely with PySpark, just like any other library. We certainly have it running without issue -- the only caveat is that you have to be a little careful with import s and have them within the functions of the Luigi class as, in the background, it is spinning up new Python instances.

2) There are ways of getting Luigi to slurp in streams of data, but it is tricky to do. Realistically, you'd fall back to running an hourly cron cycle to just call the pipeline and process and new data. This sort of reflects Spotify's use case for Luigi where they run daily jobs for calculate top artist, etc.

As @RonD suggests, if I was building a new pipeline now, I'd skip Luigi and go straight to AirFlow. If nothing else, look at the release history. Luigi hasn't really been significantly worked on for a long time (because it works for the main dev). Whereas AirFlow is actively being incubated by Apache.

Instead of Luigi use Apache Airflow for workflow orchestration (code is written in Python). It has a lot of operators and hooks built in which you can call in DAGs (Workflows). For example create task to call operator to start up EMR cluster, another to run PySpark script located in s3 on cluster, another to watch the run for status. You can use tasks to set up dependencies etc too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM