How to dynamically initialize tasks from another task in Airflow?

Question

I am currently working on a DAG which does the same tasks for different datasets in one DAG definition. The list of the datasets and their schema needs to be accessed from some configuration. I have ideas about where to store the configuration, but I can't figure it out how to read the configuration, than start iterating on tasks based on the result of this configration.

My code currently looks like this:

# Configuration read logic will be implemented here, working with Variable during test phase.
def _read_config(ti):
    Variable.set("table_list", ["first", "second", "third"], serialize_json=True)

# The actual processing logic will be implemented here.
def _processing(table):
    print("Current table:")
    print(table)

def processing(table):
    return PythonOperator(
        task_id=f"processing_{table}",
        python_callable=_processing,
        op_kwargs={
            "table": table
        }
    )

def scheduling():
    for table in Variable.get("table_list", deserialize_json=True)():
        processing(table)

read_config = PythonOperator(
    task_id='read_config',
    python_callable=_read_config
    )

scheduling = PythonOperator(
        task_id='scheduling',
        python_callable=_scheduling
    )


read_config >> scheduling

But that results in the following graph:

What I want to achieve is this:

Read the configuration in a task.
Iterate on the result of this task in a main scheduler task (or any other alternative possibility)
From this scheduler task(?), initialize the instances of the processing task.

Is there a proper way to to this in Airflow? I am open to new suggestions, the only important thing is to perform these 3 steps properly.

Answer 1

Generating tasks dynamically in this way is not recommended in Airflow, instead Airflow provides since 2.3.0 a clean way to do it using Dynamic Task Mapping .

with DAG(dag_id="dag id", start_date=...) as dag:

    @task
    def read_config():
        # here load the config from variables, db, S3/GCS file, ...
        return ["first", "second", "third"]


    @task
    def processing(table: str):
        # implement your processing function
        print(f"Current table: {table}")

    config = read_config()
    processing.expand(table=config)

Airflow will store the config returned in the first task as a xcom, then the second task will pull it, and will run an instance for each element in the list. In the UI you will find one instance with a number of mapped instances, but don't worry, if you have multiple workers, all the instances will be executed in parallel on different workers (it depends on your parallelism config).

How to dynamically initialize tasks from another task in Airflow?

Question

1 answers

solution1
0 2022-08-08 22:22:39

How to dynamically initialize tasks from another task in Airflow?

Question

1 answers

solution1 0 2022-08-08 22:22:39

solution1
0 2022-08-08 22:22:39