简体   繁体   English

Airflow DAG 激活,但有延迟

[英]Airflow DAGs activate, but with a lag

So I made an Apache Airflow system in a Docker and so far it works perfectly well, with one problem, that persists through all dags: they activate on the previous iteration, not the current one.因此,我在 Docker 中创建了一个 Apache Airflow 系统,到目前为止它运行良好,但有一个问题在所有 dag 中都存在:它们在上一次迭代中激活,而不是当前迭代。

For example, if I make a DAG that activates every minute, when it is 15:08, it will activate the DAG for 15:07.例如,如果我制作一个每分钟激活一次的 DAG,当它是 15:08 时,它将激活 15:07 的 DAG。 And if I make a DAG that activates every year, when it is 2023, it will activate the DAG for 2022, but not the current year.如果我制作一个每年激活的 DAG,当它是 2023 年时,它将激活 2022 年的 DAG,而不是当年。

Is there any way to fix this?有没有什么办法解决这一问题? Or is it supposed to be that way, and I should just account for this?或者它应该是那样的,我应该解释一下吗?

Here is the code for some of my dags as an example:以下是我的一些 dags 的代码作为示例:

from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
import logging    
import random
import pandas as pd
import sqlalchemy
from airflow.utils.log.logging_mixin import LoggingMixin
from dateutil.relativedelta import relativedelta
  

import requests

from datetime import datetime

def test_print(ds, foo, **kwargs):
    start_date = str(ds)
    end_date = str((datetime.strptime(ds, '%Y-%m-%d') + relativedelta(years=1)).date())

    
    print('HOLIDAYS:')
    print('--------------')
    print('START DATE:' + start_date)
    print('END DATE:' + end_date)
    print('--------------')
    
    now = ds
    data2send = {'the_date_n_hour': now}
    
    r = requests.post("http://[BACKEND SERVER]:8199/do_work/",json=data2send)
    print(r.text)
    assert now in r.text
    
    task_logger = logging.getLogger('airflow.task')
    task_logger.warning(r.text)
    
    return 'ok'

dag = DAG('test_test', description='test DAG',
          schedule_interval='*/1 * * * *',
          start_date=datetime(2017, 3, 20), catchup=False)

test_operator = PythonOperator(task_id='test_task', 
                               python_callable=test_print, 
                               dag=dag,
                               provide_context = True,
                               op_kwargs={'foo': 'bar'})

test_operator
from __future__ import print_function

import time
from builtins import range
from pprint import pprint

import airflow
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator

import sqlalchemy
import pandas as pd
import datetime
import requests

from dateutil.relativedelta import relativedelta

args = {
    'owner': 'airflow',
    "depends_on_past": False,
    "retries": 12,
    "retry_delay": datetime.timedelta(minutes=60)}
   
dag = DAG(
    dag_id='dag_holidays',
    default_args=args,
    schedule_interval='0 12 1 1 *',
    start_date=datetime.datetime(2013, 1, 1), 
    catchup=True)

def get_holidays(ds, gtp_id, **kwargs):
    """Wait a bit so that SQL isn't overwhelmed"""
    holi_start_date = str(ds)
    holi_end_date = str((datetime.strptime(ds, '%Y-%m-%d') + relativedelta(years=1)).date())

    
    print('HOLIDAYS:')
    print('--------------')
    print('GTP ID: {}'.format(str(gtp_id)))
    print('START DATE:' + holi_start_date)
    print('END DATE:' + holi_end_date)
    print('--------------')
    r = requests.post("http://[BACKEND SERVER]/load_holidays/",data={'gtp_id': gtp_id, 'start_date': holi_start_date, 'end_date': holi_end_date})
    if 'Error' in r.text:
        raise Exception(r.text)
    else:
        return r.text
    return ds

engine = sqlalchemy.create_engine('[SQL SERVER]')
query_string1 = f""" select gtp_id from gtps"""
all_ids = list(pd.read_sql_query(query_string1,engine).gtp_id)


for i, gtp_id in enumerate(all_ids):
    task = PythonOperator(
        task_id='holidays_' + str(gtp_id),
        python_callable=get_holidays,
        provide_context = True,
        op_kwargs={'gtp_id': gtp_id},
        dag=dag,
    )

    task

Yes, this is supposed to be this way and it can definitely be a bit confusing at first.是的,这应该是这样的,一开始肯定会有点混乱。

The reason for this behavior is that Airflow was used for a lot of ETL type processing when it was built and with that pattern you are running your DAG on the data of the previous interval.出现此行为的原因是 Airflow 在构建时用于大量 ETL 类型处理,并且使用该模式,您可以在前一个时间间隔的数据上运行 DAG。

For example when your data processing DAG runs every day at 3am, the data it processes is the data what was collected since 3am the previous day.例如,当你的数据处理 DAG 每天凌晨 3 点运行时,它处理的数据是前一天凌晨 3 点以来收集的数据。 This period is called the Data Interval in Airflow terms.这个周期在 Airflow 术语中称为数据间隔 The start of the data interval is the Logical Date (in earlier versions called execution date), which is what is incorporated into the Run ID.数据间隔的开始是逻辑日期(在早期版本中称为执行日期),它包含在运行 ID 中。 I think this is what you are seeing as the previous iteration.我认为这就是您所看到的上一次迭代。 The end of the data interval is the Run After date, this is when the DAG actually will be scheduled to run.数据间隔的结束是Run After日期,这是 DAG 实际计划运行的时间。

When you hover over the Next Run: field in the Airflow UI for a given DAG you will see all of those dates and timestamps for the next run of a specific DAG.当您在给定 DAG 的 Airflow UI 中的Next Run:字段上输入 hover 时,您将看到特定 DAG 的下一次运行的所有这些日期和时间戳。

This guide on scheduling DAGs might be helpful as a reference and it has some examples.这个关于调度 DAG 的指南可能有助于作为参考,它有一些例子。

Disclaimer: I work for Astronomer, the company behind the guide I linked.免责声明:我为天文学家工作,这是我链接的指南背后的公司。 :) :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM