简体   繁体   English

如何动态创建 airflow 任务

[英]How to create airflow task dynamically

We have 100+ Airflow DAGs which runs daily on their scheduled time.我们有 100 多个 Airflow DAG,每天按预定时间运行。 Upon failure the DAGs alerts via Email.失败后,DAG 会通过 Email 发出警报。 Our business team wants a notification when all 100+ DAGs are successfully completed for that day so they can analyze the report data.我们的业务团队希望在当天成功完成所有 100 多个 DAG 时收到通知,以便他们分析报告数据。

One way is to create a DAG which will monitor all these 100+ DAGs and upon successful, will trigger an email to business team.一种方法是创建一个 DAG,它将监控所有这 100 多个 DAG,并在成功后向业务团队触发 email。

The problem with this approach is we need to have 100+ ExternalTaskSensor operator which will monitor all these DAGs, also from maintenance point of view its not good as number of DAGs keep on increasing.这种方法的问题是我们需要有 100 多个 ExternalTaskSensor 操作员来监控所有这些 DAG,从维护的角度来看,随着 DAG 的数量不断增加,它也不好。

We know that creating dynamic task is possible as per - How to dynamically create tasks in airflow我们知道创建动态任务是可能的 - 如何在 airflow 中动态创建任务

But how to iterate 100+ values(dag id) stored in Airflow CLI Variables in DAGs?但是如何迭代存储在 DAG 中的 Airflow CLI 变量中的 100 多个值(dag id)?

The best solution is storing the dag ids in a file accessible by Airflow.最好的解决方案是将 dag id 存储在 Airflow 可访问的文件中。 But if it is complicated or you want a fully dynamic way to achieve that, you can add a new tag to_monitor for the dags you want to monitor.但是如果它很复杂或者你想要一个完全动态的方法来实现它,你可以为你想要监控的 dag 添加一个新的标签to_monitor

Here is an example for 4 dags to monitor and 4 other dags:这是要监视的 4 个 dag 和其他 4 个 dag 的示例:

from airflow import DAG
from airflow.operators.empty import EmptyOperator
from datetime import datetime

for i in range(4):
    dag_id = f'monitor_dag_{i}'
    with DAG(
            dag_id=dag_id,
            start_date=datetime(2022, 8, 27),
            tags=["to_monitor"]
    ) as dag:
        task = EmptyOperator(
            task_id='empty_task',
        )
        globals()[dag_id] = dag


for i in range(4):
    dag_id = f'no_monitor_dag_{i}'
    with DAG(
            dag_id=dag_id,
            start_date=datetime(2022, 8, 27),
    ) as dag:
        task = EmptyOperator(
            task_id='empty_task',
        )
        globals()[dag_id] = dag

Then you can use PostgresHook to access Airflow Metadat (if you are using another database, you should replace this by its hook) and query the tags table to get the dags ids which have the tag to_monitor .然后您可以使用PostgresHook访问 Airflow 元数据(如果您使用另一个数据库,您应该用它的钩子替换它)并查询标签表以获取具有标签to_monitor的 dags id。 Finally you can create your sensors automatically:最后,您可以自动创建传感器:

from airflow.models import DAG
from airflow import XComArg
from airflow.operators.empty import EmptyOperator
from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.sensors.external_task import ExternalTaskSensor
from datetime import datetime


def get_dags_ids_to_monitor():
    postgres_hook = PostgresHook()
    dags_ids = postgres_hook.get_records(
        sql="SELECT dag_id FROM dag_tag"
    )
    # convert list of tuples of one string to a list of strings
    return [dag_id[0] for dag_id in dags_ids]


with DAG(
    dag_id="monitor_dag",
    start_date=datetime(2022, 8, 27)
) as dag:

    dags_to_monitor = get_dags_ids_to_monitor()

    sensor_tasks = [
        ExternalTaskSensor(
            task_id=f"{external_dag_id}_external_tasks_sensor",
            external_task_id="empty_task",
            external_dag_id=external_dag_id
        )
        for external_dag_id in dags_to_monitor
    ]
    notify_task = EmptyOperator(task_id="send_notification")
    sensor_tasks >> notify_task

Here is the graph:这是图表:
在此处输入图像描述

For your daily dags, you need just to add at the end of these dags, an Empty task finish to use it in the sensors ( empty_task in my example), and create a new connection with Airflow Metadata creds.对于您的日常 dag,您只需在这些 dag 的末尾添加一个 Empty 任务finish以在传感器中使用它(在我的示例中为empty_task ),并使用 Airflow 元数据凭据创建一个新连接。

With this solution, Airflow will query the database every dag_dir_list_interval and on every task run, but don't worry it's similar to getting a variable in the task instance with Variable.get() .使用此解决方案,Airflow 将在每个dag_dir_list_interval和每个任务运行时查询数据库,但不要担心它类似于使用Variable.get()在任务实例中获取变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM