简体   繁体   English

如何在 Airflow 中的不同 DAG 之间设置优先级

[英]How to set priority across different DAGs in Airflow

Let's say we have two DAGs, dag1 and dag2, they serve different business requirements.假设我们有两个 DAG,dag1 和 dag2,它们服务于不同的业务需求。 they are completely unrelated.它们完全无关。 but dag1 is more important to have it finished as early as possible.但 dag1 更重要的是尽早完成它。
For simplicty, they both have only one task and they run daily.为简单起见,它们都只有一项任务并且每天运行。

In a scenario, where dag1 is behind the schedule with 2 or 3 days, I want to make sure that dag1 runs and completes its dag_runs first, ie dag1 is up to date following that dag2 is able to proceed.在 dag1 落后于计划 2 或 3 天的情况下,我想确保 dag1 运行并首先完成其 dag_runs,即 dag1 在 dag2 能够继续之后是最新的。

I tried priority_weight but it doesn't work across different dags.我尝试过 priority_weight 但它不适用于不同的 dag。

I need a way of putting those tasks from both different dags at the same queue and achieving DAG-level prioritization.我需要一种将来自两个不同 dag 的任务放在同一个队列中并实现 DAG 级别优先级的方法。

From the official documentation for the External Task Sensor : 外部任务传感器的官方文档中:

Waits for a different DAG or a task in a different DAG to complete for
a specific execution_date.

    :param external_dag_id: The dag_id that contains the task you want to
        wait for
    :type external_dag_id: str
    :param external_task_id: The task_id that contains the task you want to
        wait for. If ``None`` the sensor waits for the DAG
    :type external_task_id: str
    :param allowed_states: list of allowed states, default is ``['success']``
    :type allowed_states: list
    :param execution_delta: time difference with the previous execution to
        look at, the default is the same execution_date as the current task or DAG.
        For yesterday, use [positive!] datetime.timedelta(days=1). Either
        execution_delta or execution_date_fn can be passed to
        ExternalTaskSensor, but not both.
    :type execution_delta: datetime.timedelta
    :param execution_date_fn: function that receives the current execution date
        and returns the desired execution dates to query. Either execution_delta
        or execution_date_fn can be passed to ExternalTaskSensor, but not both.
    :type execution_date_fn: callable
    :param check_existence: Set to `True` to check if the external task exists (when
        external_task_id is not None) or check if the DAG to wait for exists (when
        external_task_id is None), and immediately cease waiting if the external task
        or DAG does not exist (default value: False).
    :type check_existence: bool

Both DAGs should have depends_on_past Trigger Rule set to True so that the newer scheduled DAG runs will only execute if the previous scheduled runs have completed successfully. 这两个DAG都应将depends_on_past 触发规则设置为True以便仅在先前的调度运行成功完成后才执行较新的调度DAG运行。

Then add the External Task Sensor at the beginning of Dag 2 ( the one which executes later ). 然后将外部任务传感器添加到Dag 2的开头(稍后执行)。

Alternatively you could create your own custom sensor and use it via Airflow Plugins in order to check the metadatabase for the status of Dag Runs. 或者,您可以创建自己的自定义传感器,并通过Airflow插件使用它,以便检查元数据库中Dag Run的状态。

You could also build customer sensors that utilise either Airflow XCOMs or Airflow Variables to pass execution run times or any other Airflow Macro to a Sensor in DAG 2. 您还可以构建利用Airflow XCOMAirflow Variables将执行运行时间或任何其他Airflow Macro传递给DAG 2中的Sensor的客户传感器。

I find an ad-hoc solution where I could simply wrap both dags within a locking layer. 我找到了一个临时解决方案,可以将两个dag都包装在锁定层中。

By that I mean, we add a simple task at the start that locks a specific row in a database and then at the end of the dag, we also add a simple task that unlocks the locked row. 我的意思是,我们在开始时添加了一个简单的任务来锁定数据库中的特定行,然后在dag的末尾添加一个简单的任务来解锁锁定的行。
Therefore when one of both dags is currently executing and the other wants to get started, it simply gets blocked because it cannot lock the specific row that is known for both dags. 因此,当两个dag中的一个当前正在执行而另一个则要开始时,由于它无法锁定两个dag已知的特定行,因此它只是被阻塞了。

Below is a simple description of the locking layer 以下是锁定层的简单说明
dag1: lock_operator, task1, unlock_operator. dag1: lock_operator,task1,unlock_operator。
dag2: lock_operator, task1, unlock_operator. dag2: lock_operator,task1,unlock_operator。

Of course, we could let the lock_operator fail if it can not lock the row, and set retries_count to be very high so that we guarantee, it will still retry to lock until it can. 当然,如果lock_operator无法锁定行,我们可以让它失败,并将retries_count设置为很高,以便我们保证它仍将重试锁定,直到可以为止。

Your question is a bit vague as what you need is a dependency management method that for example an ExternalTaskSensor will do the trick, but what you want is a queue management with priority to make a dag (or a group of dags) special and get the hardware resources when they enter a queue in FIFO manner for example.您的问题有点含糊,因为您需要的是一种依赖管理方法,例如 ExternalTask​​Sensor 可以解决问题,但是您想要的是优先队列管理以使 dag(或一组 dag)变得特别并获得例如,当硬件资源以 FIFO 方式进入队列时。

For priority you need to first use one of the queueing architectures of Airflow (AKA parallel processing plus using r3rd party apps for queue management) like RabbitMQ+Celery or Redis+Celery and then create different queus and assign your group A dags to the queue1 and group B dags to queue2 and later in settings change resource planings for each queue.为了优先考虑,您需要首先使用 Airflow 的一种排队架构(AKA 并行处理加上使用 r3rd 方应用程序进行队列管理),如RabbitMQ+CeleryRedis+Celery ,然后创建不同的队列并将您的 A 组 dags 分配到 queue1 和B 组 dag 到 queue2,稍后在设置中更改每个队列的资源规划。

https://www.rabbitmq.com/priority.html https://www.rabbitmq.com/priority.html

Good luck祝你好运

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM