简体   繁体   English

Airflow ExternalTaskSensor 卡住

[英]Airflow ExternalTaskSensor gets stuck

I'm trying to use ExternalTaskSensor and it gets stuck at poking another DAG's task, which has already been successfully completed.我正在尝试使用 ExternalTaskSensor,但它卡在了另一个 DAG 的任务上,该任务已经成功完成。

Here, a first DAG "a" completes its task and after that a second DAG "b" through ExternalTaskSensor is supposed to be triggered.在这里,第一个 DAG“a”完成了它的任务,然后应该触发通过 ExternalTaskSensor 的第二个 DAG“b”。 Instead it gets stuck at poking for a.first_task.相反,它会卡在查找 a.first_task 上。

First DAG:第一个 DAG:

import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

dag = DAG(
    dag_id='a',
    default_args={'owner': 'airflow', 'start_date': datetime.datetime.now()},
    schedule_interval=None
)

def do_first_task():
    print('First task is done')

PythonOperator(
    task_id='first_task',
    python_callable=do_first_task,
    dag=dag)

Second DAG:第二个 DAG:

import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.sensors import ExternalTaskSensor

dag = DAG(
    dag_id='b',
    default_args={'owner': 'airflow', 'start_date': datetime.datetime.now()},
    schedule_interval=None
)

def do_second_task():
    print('Second task is done')

ExternalTaskSensor(
    task_id='wait_for_the_first_task_to_be_completed',
    external_dag_id='a',
    external_task_id='first_task',
    dag=dag) >> \
PythonOperator(
    task_id='second_task',
    python_callable=do_second_task,
    dag=dag)

What am I missing here?我在这里错过了什么?

ExternalTaskSensor assumes that you are dependent on a task in a dag run with the same execution date. ExternalTaskSensor假定您依赖于具有相同执行日期的 dag 运行中的任务。

This means that in your case dags a and b need to run on the same schedule (eg every day at 9:00am or w/e).这意味着在您的情况下,dag ab需要按相同的时间表运行(例如,每天上午 9:00 或 w/e)。

Otherwise you need to use the execution_delta or execution_date_fn when you instantiate an ExternalTaskSensor .否则,您需要在实例化ExternalTaskSensor时使用execution_deltaexecution_date_fn

Here is the documentation inside the operator itself to help clarify further:这是操作员本身内部的文档,以帮助进一步澄清:

:param execution_delta: time difference with the previous execution to
    look at, the default is the same execution_date as the current task.
    For yesterday, use [positive!] datetime.timedelta(days=1). Either
    execution_delta or execution_date_fn can be passed to
    ExternalTaskSensor, but not both.

:type execution_delta: datetime.timedelta


:param execution_date_fn: function that receives the current execution date
    and returns the desired execution date to query. Either execution_delta
    or execution_date_fn can be passed to ExternalTaskSensor, but not both.

:type execution_date_fn: callable

To clarify something I've seen here and on other related questions, the dags don't necessarily have to run on the same schedule, as stated in the accepted answer.为了澄清我在这里和其他相关问题上看到的内容,dags 不一定必须按相同的时间表运行,如已接受的答案中所述。 The dags also don't need to have the same start_date . dag 也不需要具有相同的start_date If you create your ExternalTaskSensor task without the execution_delta or execution_date_fn , then the two dags need to have the same execution date .如果您在没有execution_deltaexecution_date_fn情况下创建ExternalTaskSensor任务,则两个 dag 需要具有相同的执行日期 It so happens that if two dags have the same schedule, the scheduled runs in each interval will have the same execution date.碰巧的是,如果两个 dag 具有相同的计划,则每个间隔中的计划运行将具有相同的执行日期。 I'm not sure what the execution date would be for manually triggered runs of scheduled dags.我不确定手动触发的预定 dag 运行的执行日期是什么。

For this example to work, dag b 's ExternalTaskSensor task needs an execution_delta or execution_date_fn parameter.为了让这个例子起作用,dag bExternalTaskSensor任务需要一个execution_deltaexecution_date_fn参数。 If using an execution_delta parameter, it should be such that b 's execution date - execution_delta = a 's execution date.如果使用execution_delta参数,则应该是b的执行日期 - execution_delta = a的执行日期。 If using execution_date_fn , then that function should return a 's execution date.如果使用execution_date_fn ,则该函数应返回a的执行日期。

If you were using the TriggerDagRunOperator , then using an ExternalTaskSensor to detect when that dag completed, you can do something like passing in the main dag's execution date to the triggered one with the TriggerDagRunOperator 's execution_date parameter, like execution_date='{{ execution_date }}' .如果您使用的是TriggerDagRunOperator ,然后使用ExternalTaskSensor来检测该 dag 何时完成,您可以执行一些操作,例如使用TriggerDagRunOperatorexecution_date参数将主 dag 的执行日期传递给触发日期,例如execution_date='{{ execution_date }}' . Then the execution date of both dags would be the same, and you wouldn't need the schedules to be the same for each dag, or to use the execution_delta or execution_date_fn sensor parameters.那么两个 dag 的执行日期将相同,并且您不需要每个 dag 的计划都相同,或者使用execution_deltaexecution_date_fn传感器参数。

The above was written and tested on Airflow 1.10.9以上是在 Airflow 1.10.9 上编写和测试的

As of Airflow v1.10.7, tomcm's answer is not true (at least for this version).从 Airflow v1.10.7 开始,tomcm 的回答不正确(至少对于这个版本)。 One should use execution_delta or execution_date_fn to determine the date AND schedule of the external DAG if they do not have the same schedule.如果它们没有相同的计划,则应使用execution_deltaexecution_date_fn来确定外部 DAG 的日期和计划。

From my successful case:从我的成功案例来看:

default_args = {
    'owner': 'xx',
    'retries': 2,
    'email': ALERT_EMAIL_ADDRESSES,
    'email_on_failure': True,
    'email_on_retry': False,
    'retry_delay': timedelta(seconds=30),
    # avoid stopping tasks after one day
    'depends_on_past': False,
}

dag = DAG(
    dag_id = dag_id,
    # get the datetime type value
    start_date = pendulum.strptime(current_date, "%Y, %m, %d, %H").astimezone('Europe/London').subtract(hours=1),
    description = 'xxx',
    default_args = default_args,
    schedule_interval = timedelta(hours=1),
    )
...
    external_sensor= ExternalTaskSensor(
            task_id='ext_sensor_task_update_model',
            external_dag_id='xxx',
            external_task_id='xxx'.format(log_type),
            # set the task_id to None because of the end_task
            # external_task_id = None,
            dag=dag,
            timeout = 300,
            )
...

You can wait until the successful automatic trigger for the tasks.您可以等到任务成功自动触发。 Don't do it manually, the start_date will be different.不要手动执行,start_date 会有所不同。

Airflow by default looks for the same execution date, timestamp. Airflow 默认查找相同的执行日期和时间戳。 And if we use the execution_date_fn parameter, we have to return a list of timestamp values to look for.如果我们使用 execution_date_fn 参数,我们必须返回要查找的时间戳值列表。 Internally, the sensor will query the task_instance table of airflow to check the dag runs for the dagid, taskid, state and execution date timestamp provided as the arguments.在内部,传感器将查询气流的 task_instance 表,以检查作为参数提供的 dagid、taskid、状态和执行日期时间戳的 dag 运行。 So if we use a None schedule, the dag has to be triggered manually and in such a case, the date timestamp might be any possible value.因此,如果我们使用 None 计划,则必须手动触发 dag,在这种情况下,日期时间戳可能是任何可能的值。 I have explained it in detail here: https://link.medium.com/QzXm21asokb我在这里详细解释过: https : //link.medium.com/QzXm21asokb

I have created a new sensor inheriting the ExternalTaskSensor and it can be used to monitor dags with None schedule.我创建了一个继承 ExternalTask​​Sensor 的新传感器,它可用于监控无计划的 dag。 You can find the code at the below repo.您可以在下面的 repo 中找到代码。 https://github.com/Deepaksai1919/AirflowTaskSensor https://github.com/Deepaksai1919/AirflowTaskSensor

I ran into this as well, but in my case both DAGs were using the same schedule_interval , so none of the above suggestions helped.我也遇到了这个问题,但在我的例子中,两个 DAG 都使用相同的schedule_interval ,所以上述建议都没有帮助。

Turned out it was an Airflow bug.原来这是一个 Airflow 错误。 Templates in the external_task_id / external_task_ids fields are currently broken in v2.2.4: https://github.com/apache/airflow/issues/22782 external_task_id / external_task_ids字段中的模板目前在 v2.2.4 中已损坏: https://github.com/apache/airflow/issues/22782

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM