简体   繁体   中英

How to dynamically initialize tasks from another task in Airflow?

I am currently working on a DAG which does the same tasks for different datasets in one DAG definition. The list of the datasets and their schema needs to be accessed from some configuration. I have ideas about where to store the configuration, but I can't figure it out how to read the configuration, than start iterating on tasks based on the result of this configration.

My code currently looks like this:

# Configuration read logic will be implemented here, working with Variable during test phase.
def _read_config(ti):
    Variable.set("table_list", ["first", "second", "third"], serialize_json=True)

# The actual processing logic will be implemented here.
def _processing(table):
    print("Current table:")
    print(table)

def processing(table):
    return PythonOperator(
        task_id=f"processing_{table}",
        python_callable=_processing,
        op_kwargs={
            "table": table
        }
    )

def scheduling():
    for table in Variable.get("table_list", deserialize_json=True)():
        processing(table)

read_config = PythonOperator(
    task_id='read_config',
    python_callable=_read_config
    )

scheduling = PythonOperator(
        task_id='scheduling',
        python_callable=_scheduling
    )


read_config >> scheduling

But that results in the following graph:

在此处输入图像描述

What I want to achieve is this:

  1. Read the configuration in a task.
  2. Iterate on the result of this task in a main scheduler task (or any other alternative possibility)
  3. From this scheduler task(?), initialize the instances of the processing task.

Is there a proper way to to this in Airflow? I am open to new suggestions, the only important thing is to perform these 3 steps properly.

Generating tasks dynamically in this way is not recommended in Airflow, instead Airflow provides since 2.3.0 a clean way to do it using Dynamic Task Mapping .

with DAG(dag_id="dag id", start_date=...) as dag:

    @task
    def read_config():
        # here load the config from variables, db, S3/GCS file, ...
        return ["first", "second", "third"]


    @task
    def processing(table: str):
        # implement your processing function
        print(f"Current table: {table}")

    config = read_config()
    processing.expand(table=config)

Airflow will store the config returned in the first task as a xcom, then the second task will pull it, and will run an instance for each element in the list. In the UI you will find one instance with a number of mapped instances, but don't worry, if you have multiple workers, all the instances will be executed in parallel on different workers (it depends on your parallelism config).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM