如何從 Airflow 中的另一個任務動態初始化任務？

Question

我目前正在研究一個 DAG，它在一個 DAG 定義中對不同的數據集執行相同的任務。 需要從某些配置中訪問數據集列表及其模式。 我有關於在哪里存儲配置的想法，但我不知道如何讀取配置，而不是根據此配置的結果開始迭代任務。

我的代碼目前如下所示：

# Configuration read logic will be implemented here, working with Variable during test phase.
def _read_config(ti):
    Variable.set("table_list", ["first", "second", "third"], serialize_json=True)

# The actual processing logic will be implemented here.
def _processing(table):
    print("Current table:")
    print(table)

def processing(table):
    return PythonOperator(
        task_id=f"processing_{table}",
        python_callable=_processing,
        op_kwargs={
            "table": table
        }
    )

def scheduling():
    for table in Variable.get("table_list", deserialize_json=True)():
        processing(table)

read_config = PythonOperator(
    task_id='read_config',
    python_callable=_read_config
    )

scheduling = PythonOperator(
        task_id='scheduling',
        python_callable=_scheduling
    )


read_config >> scheduling

但這會導致下圖：

我想要實現的是：

讀取任務中的配置。
在主調度程序任務（或任何其他替代可能性）中迭代此任務的結果
從這個調度任務（？），初始化處理任務的實例。

在 Airflow 中是否有適當的方法來解決這個問題？ 我願意接受新的建議，唯一重要的是正確執行這 3 個步驟。

Answer 1

在 Airflow 中不建議以這種方式動態生成任務，而是 Airflow 從2.3.0開始提供了一種使用動態任務映射的干凈方式來執行此操作。

with DAG(dag_id="dag id", start_date=...) as dag:

    @task
    def read_config():
        # here load the config from variables, db, S3/GCS file, ...
        return ["first", "second", "third"]


    @task
    def processing(table: str):
        # implement your processing function
        print(f"Current table: {table}")

    config = read_config()
    processing.expand(table=config)

Airflow 會將第一個任務中返回的配置存儲為 xcom，然后第二個任務將拉取它，並為列表中的每個元素運行一個實例。 在 UI 中，您會發現一個帶有多個映射實例的實例，但不用擔心，如果您有多個工作器，所有實例將在不同的工作器上並行執行（這取決於您的並行配置）。

如何從 Airflow 中的另一個任務動態初始化任務？

問題描述

1 個解決方案

解決方案1
0 2022-08-08 22:22:39

如何從 Airflow 中的另一個任務動態初始化任務？

問題描述

1 個解決方案

解決方案1 0 2022-08-08 22:22:39

解決方案1
0 2022-08-08 22:22:39