使用Apache Airflow使用PySpark代码执行Databricks Notebook

Question

我正在使用Airflow，Databricks和PySpark。 我想知道当我想通过Airflow执行Databricks Notebook时是否可以添加更多参数。

我在Python中有了下一个名为MyETL的代码：

def main(**kwargs):
      spark.sql("CREATE TABLE {0} {1}".format(table, columns))
      print("Running my ETL!")

    if __name__== "__main__":
      main(arg1, arg2)

我想定义其他任务参数来运行带有更多参数的Databricks笔记本，我想添加方法的名称以及这些方法的参数。 例如，当我想在Airflow的DAG中注册任务时：

   notebook_task_params = {
        'new_cluster': new_cluster,
        'notebook_task': {
            'notebook_path': '/Users/airflow@example.com/MyETL',
            'method_name': 'main',
            'params':'[{'table':'A'},{'columns':['a', 'b']}]'
        },
    }

我不知道是否可行，因为我没有找到类似的例子。

# Example of using the JSON parameter to initialize the operator.
notebook_task = DatabricksSubmitRunOperator(
    task_id='notebook_task',
    dag=dag,
    json=notebook_task_params)

换句话说，我想使用Airflow使用参数执行笔记本。 我的问题是我该怎么做？

Answer 1

您也可以将method_name添加为params ，然后在笔记本上解析出您的逻辑。

但是，这里更常见的模式是确保该方法已安装在您的群集上。

params = '[{'table':'A'},{'columns':['a', 'b']}]'

然后在笔记本上的数据块上：

table = getArgument("table", "DefaultValue")
columns = getArgument("columns", "DefaultValue")

result = method(table, columns)

如果您可以在笔记本作业运行中看到参数（如上图所示），您还将知道这些参数是否可以通过getArgument()访问。

使用Apache Airflow使用PySpark代码执行Databricks Notebook

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-06-25 15:42:38

使用Apache Airflow使用PySpark代码执行Databricks Notebook

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-06-25 15:42:38

解决方案1
2 已采纳 2019-06-25 15:42:38