简体   繁体   English

ETL in Airflow 由 Jupyter Notebooks 和 Papermill 协助

[英]ETL in Airflow aided by Jupyter Notebooks and Papermill

So my issue is that I build ETL pipelines in Airflow, but really develop and test the Extract, Transform and Load functions in Jupyter notebooks first.所以我的问题是我在 Airflow 中构建了 ETL 管道,但实际上首先在 Jupyter 笔记本中开发和测试提取、转换和加载功能。 So I end up copy-pasting back and forth all the time, between my Airflow Python operator code and Jupyter notebooks, pretty inefficient.所以我最终一直在我的 Airflow Python 操作员代码和 Jupyter 笔记本之间来回复制粘贴,效率很低。 My gut tells me that all of this can be automated.我的直觉告诉我,所有这些都可以自动化。

Basically, I would like to write my Extract, Transform and Load functions in Jupyter and have them stay there, while still running the pipeline in Airflow and having the extract, transform and load tasks show up, with retries and all the good stuff that Airflow provides out of the box.基本上,我想在 Jupyter 中编写我的提取、转换和加载函数并让它们留在那里,同时仍在 Airflow 中运行管道并显示提取、转换和加载任务,并重试和 Airflow 的所有好东西开箱即用。

Papermill is able to parameterize notebooks, but I really can't think of how that would help in my case. Papermill 能够参数化笔记本,但我真的想不出这对我的情况有何帮助。 Can someone please help me connect the dots?有人可以帮我把这些点联系起来吗?

[ Disclaimer: I am one of the committers for the mentioned open source project. [免责声明:我是上述开源项目的提交者之一。 ] We've created Elyra - a set of JupyterLab extensions - to streamline exactly this kind of work. ] 我们创建了Elyra——一组 JupyterLab 扩展——来简化这种工作。 We've just released version 2.1, which provides a visual editor that you can use to assemble pipelines from notebooks and Python scripts (R support should be available soon) and run them on Apache Airflow, Kubeflow Pipelines, or locally in JupyterLab.我们刚刚发布了 2.1 版,它提供了一个可视化编辑器,您可以使用它从笔记本和 Python 脚本(R 支持应该很快可用)组装管道,并在 Apache Airflow、Kubeflow Pipelines 或本地 JupyterLab 中运行它们。 For Airflow (running on Kubernetes) we've created a custom operator that takes care of housekeeping and execution.对于 Airflow(在 Kubernetes 上运行),我们创建了一个自定义操作符来处理内务管理和执行。 I've wrote a summary article about it that you can find here and we've got a couple of introductory tutorials if you are interested in trying this out.我写了一篇关于它的摘要文章,你可以在这里找到,如果你有兴趣尝试,我们有几个介绍性教程

A single master Jupyter notebook, with any number of slave notebooks (used as templates), executed in sequence using papermill.execute_notebook , should be sufficient to automate any ML pipeline.单个主 Jupyter 笔记本,带有任意数量的从笔记本(用作模板),使用papermill.execute_notebook按顺序执行,应该足以自动化任何 ML 管道。

To pass information between pipeline stages (from one slave notebook to the next one(s)), it's possible to use another Netflix package, scrapbook , which allows us to record python objects in slave notebooks (as they are processed by papermill ) and then to retrieve these objects from slaves in the pipeline master (saving uses scrapbook.glue and reading - scrapbook.read_notebook ).为了在管道阶段(从一个从笔记本到下一个)之间传递信息,可以使用另一个 Netflix 包scrapbook ,它允许我们在从笔记本中记录 python 对象(因为它们由papermill处理),然后从管道主机中的从属设备中检索这些对象(保存使用scrapbook.glue和阅读scrapbook.read_notebook )。

Resuming from any completed stage is also possible but it requires storing necessary inputs saved during previous stage(s) in a predictable place reachable from the master (eg in a local master JSON file or in MLflow).从任何已完成的阶段恢复也是可能的,但它需要将在前一阶段保存的必要输入存储在可从 master 到达的可预测位置(例如,在本地 master JSON 文件或 MLflow 中)。

The master notebook can be also scheduled with a cron job, eg from Kubernetes ).主笔记本也可以使用 cron 作业进行调度,例如来自 Kubernetes )。

  • Alternatives备择方案

Airflow is probably an overkill for most ML teams due to admin costs (5 containers, incl. 2 databases), while other (non-Netflix) python packages would either require more boilerplate (Luigi) or extra priviledges and custom docker images for executors (Elyra), while Ploomber would expose to few-maintainers risk.由于管理成本(5 个容器,包括 2 个数据库),Airflow 对大多数 ML 团队来说可能是一种矫枉过正,而其他(非 Netflix)python 包要么需要更多样板文件(Luigi),要么需要额外的特权和用于执行程序的自定义 docker 图像( Elyra),而 Ploomber 将面临少数维护者的风险。

It is possible to use Jupyter Notebooks in your Airflow pipeline, as you suggest, via Papermill.根据您的建议,可以通过 Papermill 在您的 Airflow 管道中使用 Jupyter Notebooks。 However, one of the advantages of Airflow is that you can separate your pipeline into discrete steps, that are independent of each other, so if you decide to write the whole pipeline in one Jupyter Notebook, then that defeats the purpose of using Airflow.但是,Airflow 的优点之一是您可以将管道分成离散的步骤,这些步骤彼此独立,因此如果您决定在一个 Jupyter Notebook 中编写整个管道,那就违背了使用 Airflow 的目的。

So, assuming that each one of your discrete ETL steps lives in a separate Jupyter Notebook, you could try the following:因此,假设您的每个独立ETL 步骤都位于单独的 Jupyter Notebook 中,您可以尝试以下操作:

  1. Create one Jupyter Notebook for each step.为每一步创建一个 Jupyter Notebook。 For example, copy_data_from_s3 , cleanup_data , load_into_database (3 steps, one notebook for each).例如, copy_data_from_s3cleanup_dataload_into_database (3 个步骤,每个步骤一个笔记本)。
  2. Ensure that each notebook is parametrized per the Papermill instructions .确保每个笔记本都按照 Papermill 说明进行参数设置 This means, add a tag to each cell that declares variables that can be parametrized from outside.这意味着,向每个单元格添加一个标签,声明可以从外部参数化的变量。
  3. Ensure these notebooks are findable by Airflow (eg in the same folder as where the DAG lives)确保 Airflow 可以找到这些笔记本(例如,在 DAG 所在的文件夹中)
  4. Write functions that will use Papermill to parametrize and run your notebooks, one for each step.编写将使用 Papermill 参数化和运行笔记本的函数,每个步骤一个。 For example:例如:
import papermill as pm
# ...
# define DAG, etc.
# ...

def copy_data_from_s3(**context):
    pm.execute_notebook(
           "copy_data_from_s3_step.ipynb",
           "copy_data_from_s3_step.ipynb"
            parameters=dict(date=context['execution_date'])) # pass some context parameter if you need to
        )
  1. Finally, set up the step, perhaps as a PythonOperator (although you can also use a BashOperator if you want to run Papermill from the command line ).最后,设置步骤,可能作为PythonOperator (尽管如果您想从命令行运行 Papermill 您也可以使用BashOperator )。 To match the function from above:要匹配上面的函数:
copy_data = PythonOperator(dag=dag,
                           task_id='copy_data_task',
                           provide_context=True,
                           python_callable=copy_data_from_s3)

Airflow has a papermill operator , but development experience isn't great. Airflow 有一个造纸厂操作员,但开发经验不是很好。 One of the main issues with Python-based DAGs in Airflow is that they are executed in the same Python environment, which can cause dependency problems as soon as you have more than one DAG. Airflow 中基于 Python 的 DAG 的主要问题之一是它们在同一个 Python 环境中执行,一旦您拥有多个 DAG,就会导致依赖性问题。 See this for more details . 有关更多详细信息,请参阅此内容

If you are willing to give a new tool a try, I'd recommend you to use Ploomber (Disclaimer: I'm the author), which can orchestrate notebook-based pipelines (it uses papermill under the hood).如果您愿意尝试新工具,我建议您使用Ploomber (免责声明:我是作者),它可以编排基于笔记本的管道(它在幕后使用纸厂)。 You can develop locally and export to Kubernetes or Airflow.您可以在本地开发并导出到 Kubernetes 或 Airflow。

If you want to know how a project in Ploomber looks like, feel free to take a look at the examples repository .如果您想知道 Ploomber 中的项目是什么样的,请随时查看示例存储库

Why do you want the ETL jobs as jupyter notebook.为什么要将 ETL 作业作为 jupyter notebook。 What advantage do you see?你看到什么优势? The Notebooks are generally meant for building a nice document with live data. Notebooks 通常用于构建带有实时数据的精美文档。 The ETL jobs are supposed to be scripts running in the background and automated. ETL 作业应该是在后台运行并自动化的脚本。

Why can't these jobs be plain python code instead of notebook?为什么这些工作不能是纯 python 代码而不是笔记本?

Also when you run the notebook using PapermillOperator the output of the run will be another notebook saved somewhere.此外,当您使用 PapermillOperator 运行笔记本时,运行的输出将是另一个保存在某处的笔记本。 It is not that friendly to keep checking these output files.继续检查这些输出文件并不是那么友好。

I would recommend writing the ETL job in plain python and run it with PythonOperator.我建议用普通的 Python 编写 ETL 作业并使用 PythonOperator 运行它。 This is much more simpler and easier to maintain.这更简单,更容易维护。

If you want to use the notebook for it's fancy features, that is a different thing.如果您想使用笔记本电脑的奇特功能,那是另一回事。

Ploomber will be able to solve your problems! Ploomber 将能够解决您的问题! It actually supports parameterizing notebooks and deploying via Airflow\/shell scripts\/Slurm\/Kubeflow\/Argo.它实际上支持参数化笔记本和通过 Airflow\/shell 脚本\/Slurm\/Kubeflow\/Argo 进行部署。 This also helps you defining a modular pipeline instead of a monolith notebook.这也有助于您定义模块化管道而不是单体笔记本。 It's pretty easy\/straight forward to start and it gives WAY more flexibility than papermill.这很容易\/直接开始,它比造纸厂提供了更多的灵活性。 Check it out!看看这个! https:\/\/github.com\/ploomber\/ploomber<\/a> https:\/\/github.com\/ploomber\/ploomber<\/a>

"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM