简体   繁体   English

如何从 Airflow 中的另一个任务动态初始化任务?

[英]How to dynamically initialize tasks from another task in Airflow?

I am currently working on a DAG which does the same tasks for different datasets in one DAG definition.我目前正在研究一个 DAG,它在一个 DAG 定义中对不同的数据集执行相同的任务。 The list of the datasets and their schema needs to be accessed from some configuration.需要从某些配置中访问数据集列表及其模式。 I have ideas about where to store the configuration, but I can't figure it out how to read the configuration, than start iterating on tasks based on the result of this configration.我有关于在哪里存储配置的想法,但我不知道如何读取配置,而不是根据此配置的结果开始迭代任务。

My code currently looks like this:我的代码目前如下所示:

# Configuration read logic will be implemented here, working with Variable during test phase.
def _read_config(ti):
    Variable.set("table_list", ["first", "second", "third"], serialize_json=True)

# The actual processing logic will be implemented here.
def _processing(table):
    print("Current table:")
    print(table)

def processing(table):
    return PythonOperator(
        task_id=f"processing_{table}",
        python_callable=_processing,
        op_kwargs={
            "table": table
        }
    )

def scheduling():
    for table in Variable.get("table_list", deserialize_json=True)():
        processing(table)

read_config = PythonOperator(
    task_id='read_config',
    python_callable=_read_config
    )

scheduling = PythonOperator(
        task_id='scheduling',
        python_callable=_scheduling
    )


read_config >> scheduling

But that results in the following graph:但这会导致下图:

在此处输入图像描述

What I want to achieve is this:我想要实现的是:

  1. Read the configuration in a task.读取任务中的配置。
  2. Iterate on the result of this task in a main scheduler task (or any other alternative possibility)在主调度程序任务(或任何其他替代可能性)中迭代此任务的结果
  3. From this scheduler task(?), initialize the instances of the processing task.从这个调度任务(?),初始化处理任务的实例。

Is there a proper way to to this in Airflow?在 Airflow 中是否有适当的方法来解决这个问题? I am open to new suggestions, the only important thing is to perform these 3 steps properly.我愿意接受新的建议,唯一重要的是正确执行这 3 个步骤。

Generating tasks dynamically in this way is not recommended in Airflow, instead Airflow provides since 2.3.0 a clean way to do it using Dynamic Task Mapping .在 Airflow 中不建议以这种方式动态生成任务,而是 Airflow 从2.3.0开始提供了一种使用动态任务映射的干净方式来执行此操作。

with DAG(dag_id="dag id", start_date=...) as dag:

    @task
    def read_config():
        # here load the config from variables, db, S3/GCS file, ...
        return ["first", "second", "third"]


    @task
    def processing(table: str):
        # implement your processing function
        print(f"Current table: {table}")

    config = read_config()
    processing.expand(table=config)

Airflow will store the config returned in the first task as a xcom, then the second task will pull it, and will run an instance for each element in the list. Airflow 会将第一个任务中返回的配置存储为 xcom,然后第二个任务将拉取它,并为列表中的每个元素运行一个实例。 In the UI you will find one instance with a number of mapped instances, but don't worry, if you have multiple workers, all the instances will be executed in parallel on different workers (it depends on your parallelism config).在 UI 中,您会发现一个带有多个映射实例的实例,但不用担心,如果您有多个工作器,所有实例将在不同的工作器上并行执行(这取决于您的并行配置)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM