如何从 Airflow 中的另一个任务动态初始化任务？

Question

我目前正在研究一个 DAG，它在一个 DAG 定义中对不同的数据集执行相同的任务。 需要从某些配置中访问数据集列表及其模式。 我有关于在哪里存储配置的想法，但我不知道如何读取配置，而不是根据此配置的结果开始迭代任务。

我的代码目前如下所示：

# Configuration read logic will be implemented here, working with Variable during test phase.
def _read_config(ti):
    Variable.set("table_list", ["first", "second", "third"], serialize_json=True)

# The actual processing logic will be implemented here.
def _processing(table):
    print("Current table:")
    print(table)

def processing(table):
    return PythonOperator(
        task_id=f"processing_{table}",
        python_callable=_processing,
        op_kwargs={
            "table": table
        }
    )

def scheduling():
    for table in Variable.get("table_list", deserialize_json=True)():
        processing(table)

read_config = PythonOperator(
    task_id='read_config',
    python_callable=_read_config
    )

scheduling = PythonOperator(
        task_id='scheduling',
        python_callable=_scheduling
    )


read_config >> scheduling

但这会导致下图：

我想要实现的是：

读取任务中的配置。
在主调度程序任务（或任何其他替代可能性）中迭代此任务的结果
从这个调度任务（？），初始化处理任务的实例。

在 Airflow 中是否有适当的方法来解决这个问题？ 我愿意接受新的建议，唯一重要的是正确执行这 3 个步骤。

Answer 1

在 Airflow 中不建议以这种方式动态生成任务，而是 Airflow 从2.3.0开始提供了一种使用动态任务映射的干净方式来执行此操作。

with DAG(dag_id="dag id", start_date=...) as dag:

    @task
    def read_config():
        # here load the config from variables, db, S3/GCS file, ...
        return ["first", "second", "third"]


    @task
    def processing(table: str):
        # implement your processing function
        print(f"Current table: {table}")

    config = read_config()
    processing.expand(table=config)

Airflow 会将第一个任务中返回的配置存储为 xcom，然后第二个任务将拉取它，并为列表中的每个元素运行一个实例。 在 UI 中，您会发现一个带有多个映射实例的实例，但不用担心，如果您有多个工作器，所有实例将在不同的工作器上并行执行（这取决于您的并行配置）。

如何从 Airflow 中的另一个任务动态初始化任务？

问题描述

1 个解决方案

解决方案1
0 2022-08-08 22:22:39

如何从 Airflow 中的另一个任务动态初始化任务？

问题描述

1 个解决方案

解决方案1 0 2022-08-08 22:22:39

解决方案1
0 2022-08-08 22:22:39