简体繁体中英

Complex Data Pipeline Migration Plan Question

原文 2018-09-20 21:41:01 3 2 pyspark/ airflow/ amazon-emr/ luigi

My plan:

Move all data processing to Spark (PySpark preferably) with final output (consumer facing) data going to Redshift only. Spark seems to connect to all the various sources well (Dynamo DB, S3, Redshift). Output to Redshift/S3 etc depending on customer need. This avoids having multiple Redshift clusters, broken/overusing internal unsupported ETL tools, copy of the same data across clusters, views and tables etc (which is the current setup).
Use Luigi to build a web UI to daily monitor pipelines and visualise the dependency tree, and schedule ETL's. Email notifications should be an option for failures also. An alternative is AWS data pipeline, but, Luigi seems to have a better UI for what is happening where many dependencies are involved (some trees are 5 levels deep, but perhaps this can also be avoided with better Spark code).

Questions:

Does Luigi integrate with Spark (I have only used PySpark before, not Luigi, so this is a learning curve for me). The plan was to schedule 'applications' and Spark actually has ETL too I believe, so unsure how Luigi integrates here?
How to account for the fact that some pipelines may be 'real time' - would I need to spin up the Spark / EMR job hourly for example then?

I'm open to thoughts / suggestions / better ways of doing this too!

2 answers

To answer your questions directly,

1) Yes, Luigi does play nicely with PySpark, just like any other library. We certainly have it running without issue -- the only caveat is that you have to be a little careful with import s and have them within the functions of the Luigi class as, in the background, it is spinning up new Python instances.

2) There are ways of getting Luigi to slurp in streams of data, but it is tricky to do. Realistically, you'd fall back to running an hourly cron cycle to just call the pipeline and process and new data. This sort of reflects Spotify's use case for Luigi where they run daily jobs for calculate top artist, etc.

As @RonD suggests, if I was building a new pipeline now, I'd skip Luigi and go straight to AirFlow. If nothing else, look at the release history. Luigi hasn't really been significantly worked on for a long time (because it works for the main dev). Whereas AirFlow is actively being incubated by Apache.

Instead of Luigi use Apache Airflow for workflow orchestration (code is written in Python). It has a lot of operators and hooks built in which you can call in DAGs (Workflows). For example create task to call operator to start up EMR cluster, another to run PySpark script located in s3 on cluster, another to watch the run for status. You can use tasks to set up dependencies etc too.

PySpark dataframe pipeline throws No plan for MetastoreRelation Error

Complex Data Grouping in PySpark

Question on pyspark data frame for timestamp

data grouping question but based on a "window"

pyspark data pipeline use intermediary results

md5 is not working on complex data types in pyspark

In spark, is it possible to reuse a DataFrame's execution plan to apply it to different data sources

Spark data pipeline initial load impact on production DB

Spark: fetch data from complex dataframe schema with map

Nested complex json with array and structs into data frame using Pyspark

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question PySpark dataframe pipeline throws No plan for MetastoreRelation Error Complex Data Grouping in PySpark Question on pyspark data frame for timestamp data grouping question but based on a "window" pyspark data pipeline use intermediary results md5 is not working on complex data types in pyspark In spark, is it possible to reuse a DataFrame's execution plan to apply it to different data sources Spark data pipeline initial load impact on production DB Spark: fetch data from complex dataframe schema with map Nested complex json with array and structs into data frame using Pyspark

Related Tags

Complex Data Pipeline Migration Plan Question

Question

2 answers

solution1
1 2018-09-21 16:38:10

solution2
-1 2018-09-20 23:50:22

Complex Data Pipeline Migration Plan Question

Question

2 answers

solution1 1 2018-09-21 16:38:10

solution2 -1 2018-09-20 23:50:22

solution1
1 2018-09-21 16:38:10

solution2
-1 2018-09-20 23:50:22