简体   繁体   English

Airflow 传感器 - 超时

[英]Airflow Sensor - timeout

tl;dr, Problem framing: tl; 博士,问题框架:

Assuming I have a sensor poking with timeout = 24*60*60 .假设我有一个timeout = 24*60*60的传感器。 Since the connection does time out occasionally, retries must be allowed.由于连接偶尔会超时,因此必须允许retries If the sensor now retries, the timeout variable is being applied to every new try with the initial 24*60*60 , and, therefore, the task does not time out after 24 hrs as it was intended.如果传感器现在重试,则timeout变量将应用于初始为24*60*60的每次新尝试,因此,任务不会按预期在 24 小时后超时。

Question:问题:

Is there a way to restrict the max-time of a task - like a meta-timeout?有没有办法限制任务的最大时间——比如元超时?

Airflow-Version: 1.10.14气流版本: 1.10.14

Walk-thorough-thru:遍历:

BASE_DIR = "/some/base/dir/"
FILE_NAME = "some_file.xlsx"
VOL_BASE_DIR = "/some/mounted/vol/"

default_args = {
    "owner": "airflow",
    "depends_on_past": False,
    "start_date": "2020-11-01",
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
}

dag = DAG(
    "supplier",
    default_args=default_args,
    description="ETL Process for Supplier",
    schedule_interval=None,
    catchup=False,
    max_active_runs=1,
)

file_sensor =  FileSensor(
    task_id="file_sensor",
    poke_interval=60*60,
    timeout=24*60*60,
    retries=4,
    mode="reschedule",
    filepath=os.path.join(BASE_DIR,FILE_NAME)
    fs_conn_id='conn_filesensor',
    dag=dag,
)

clean_docker_vol = InitCleanProcFolderOperator(
    task_id="clean_docker_vol",
    folder=VOL_BASE_DIR,
    dag=dag,
)

....

This DAG should run and check if a file exists.此 DAG 应运行并检查文件是否存在。 If it exists, it should continue.如果它存在,它应该继续。 Occasionally, it can happen that the sensor-task is being rescheduled due to the file being provided too late (or, say, connection errors).有时,由于提供的文件太晚(或者说,连接错误),传感器任务可能会被重新安排。 The MAX-overall 'run-time' of the dag should NOT exceed 24 hrs. dag 的最大整体“运行时间”不应超过 24 小时。 Due to the retries, however, the time does exceed the 24 hrs timeout, if the tasks fails and is being rescheduled.但是,由于重试,如果任务失败并且正在重新安排,时间确实会超过 24 小时超时。

Example:例子:

  1. runs for 4 hrs (18 hrs should be left)运行 4 小时(应剩余 18 小时)
  2. fails失败
  3. up_for_retry up_for_retry
  4. starts again with 24 hrs timeout, not 18 hrs.以 24 小时超时重新开始,而不是 18 小时。

As I need to allow retries, there is not the option of just setting retries to 0 to avoid this behavior.因为我需要允许重试,所以没有将重试设置为 0 来避免这种行为的选项。 I was rather looking for a meta-timeout variable of airflow, a hint how this can be implemented within the related classes or any other workarounds.我宁愿寻找 airflow 的元超时变量,提示如何在相关类或任何其他解决方法中实现它。

many thanks.非常感谢。

You can use the poke_interval parameter to configure the poking frequency within the predefined timeout.您可以使用poke_interval参数来配置预定义超时内的戳频率。 Something like this: MySensor(..., retries=0, timeout=24*60*60, poke_interval=60*60) .像这样的东西: MySensor(..., retries=0, timeout=24*60*60, poke_interval=60*60) In this example the sensor will poke every hour and if it will not succeed within a day it will fail.在此示例中,传感器将每小时戳一次,如果它在一天内不成功,它将失败。

I implemented a rather hacky solution that yet works for me.我实施了一个相当老套的解决方案,但对我有用。

  • Added a new function to the sensor-class:向传感器类添加了新的 function:
    def _apply_meta_timeout(self,context):

        if not self.meta_task_timeout:
            return None
        elif self.meta_task_timeout and self.retries == 0:
            raise ValueError("'Meta_task_timeout' cannot be applied if 'retries' are set to 0. Use 'timeout' instead.")

        if isinstance(self.meta_task_timeout,datetime.timedelta):
            self.meta_task_timeout = meta_task_timeout.seconds
        if not isinstance(self.meta_task_timeout,(int,float)):
            raise ValueError("Cannot covert 'meta_task_timeout' to type(int) or type(float).")

        if self.meta_task_timeout < self.timeout:
             raise ValueError("'meta_task_timeout' cannot be less than 'timeout' variable.")
        
        logging.info(f"Get current dagrun params: {context['ti'].task_id}, {context['ti'].dag_id}, {context['ti'].execution_date}, {context['ti'].try_number}" )
        pg_hook = PostgresHook(postgres_conn_id="airflow-metadata-db")
        pg_cur = pg_hook.get_cursor()
        if not context['ti'].try_number == 1:
            try:  
                query = f"""
                select start_date from task_fail
                    where task_id='{context['ti'].task_id}' 
                    and dag_id='{context['ti'].dag_id}' 
                    and execution_date ='{context['ti'].execution_date}' 
                    order by start_date asc 
                    LIMIT 1;"""
                pg_cur.execute(query)
                init_start_timestamp = pg_cur.fetchone()[0] #.isoformat()
            except Exception as e:
                raise ConnectionError("Connection failed with error: " + str(e) )
            finally:
                pg_cur.close(), pg_hook.get_conn().close()
        else:
            init_start_timestamp = context['ti'].start_date #.isoformat()

        logging.info(f"Initial dag startup: {init_start_timestamp}")


        if (timezone.utcnow() - init_start_timestamp).total_seconds() > self.meta_task_timeout:
            if self.soft_fail:
                self._do_skip_downstream_tasks(context)
            raise AirflowSkipException('Snap. Maximal task runtime is UP.')

        logging.info(f"Time left until 'meta_time_out' applies: {self.meta_task_timeout - (timezone.utcnow() - init_start_timestamp).total_seconds()} second(s).
  • Overrode/added to poke-function:覆盖/添加到戳功能:
 def poke(self, context):
...
...
        # check for meta-time-out
        self._apply_meta_timeout(context)
  • Added airflow database connection as: airflow-metadata-db添加 airflow 数据库连接为:airflow airflow-metadata-db

  • Called the Sensor Operator with additional params:使用附加参数调用传感器操作员:

    dummy_sensor = FileSensor(
        task_id="file_sensor",
        remote_path=os.path.join(REMOTE_INPUT_PATH, REMOTE_INPUT_FILE),
        do_xcom_push=False,
        timeout= 60, 
        retries=2, 
        mode="reschedule",
        meta_task_timeout=5*60,
        soft_fail=True,
        #context=True,
    )

The main issue why this workaround must be applied is that airflow seems to override the initial start_date of each individual DAG-try.必须应用此解决方法的主要问题是 airflow 似乎覆盖了每个单独的 DAG-try 的初始start_date

Please feel free to add any suggestions of improvements.请随时添加任何改进建议。 Thanks谢谢

This is why we use task_retries and retry_delay for sensors instead of using poke_interval and timeout.这就是为什么我们对传感器使用 task_retries 和 retry_delay 而不是使用 poke_interval 和 timeout。 Retries achieve exactly what you want to do.重试完全达到你想要做的。 In your task definition, use在您的任务定义中,使用

retries: 24,
retry_delay: 60*60

instead of代替

poke_interval=60*60,
timeout=24*60*60,
retries=4,

where by the way you should add mode="reschedule, so that your sensor don't take a slot for its whole execution time (here, your task uses a whole slot during 24 hours which sleeps most of the time).顺便说一句,你应该添加mode="reschedule,这样你的传感器就不会在整个执行时间内占用一个槽(在这里,你的任务在 24 小时内使用整个槽,大部分时间都在休眠)。

The start_date of each dag run shouldn't be overwritten by Airflow and should be available through {{ ds }} (which is the start of the data interval) or {{ data_interval_end }} (see Airflow Documentation ).每个 dag 运行的开始日期不应被start_date覆盖,并且应该通过{{ ds }} (这是数据间隔的开始)或{{ data_interval_end }} (参见Airflow 文档)获得。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM