简体   繁体   English

气流DAG触发

[英]Airflow DAG Triggering

We have recently tried to adopt Airflow as our "data workflow" engine, and while I have figure most of the things out, I am still in the grey area about how the scheduler calculates when to trigger DAGs. 最近,我们尝试将Airflow用作我们的“数据工作流”引擎,尽管我已经弄清了大部分内容,但对于调度程序如何计算何时触发DAG,我仍然处于灰色地带。

Take a look at this simple dag: 看一下这个简单的dag:

from airflow import DAG
from datetime import datetime
from airflow.operators.bash_operator import BashOperator

dag_options = {                
            'owner':                'Airflow',  
            'depends_on_past':      False,      
            'start_date':           datetime.now()
}

with DAG('test_dag1', schedule_interval="5 * * * *", default_args=dag_options) as dag:
                task1 = BashOperator(      
                task_id='task1', 
                bash_command='date',                
                dag=dag)      

The schedule will pick this up, but will not execute it. 计划将对此进行处理,但不会执行。 Now if I change the "start_date" to: 现在,如果我将“ start_date”更改为:

datetime(year=xxxx,month=yyyy=day=zzzz) 

where xxxx,yyyy,zzzz are today's date, it will start executing. 其中xxxx,yyyy,zzzz是今天的日期,它将开始执行。 Is the cause of this that the scheduler keeps re-reading this dags from the source dag folder, executing datetime.now() each time, noticing the start date is different from currently queued, re-adding this dag and therefore re-scheduling/pushing the execution date forward (my dag_dir_list_interval is 300)? 这是因为调度程序不断从源dag文件夹中重新读取此dag,每次执行datetime.now()时,注意到开始日期与当前排队不同,因此重新添加了该dag并因此重新调度/将执行日期提前(我的dag_dir_list_interval是300)?

Also, in airflow, as I understand it, when a dag is un-paused (or added with dags_are_paused_at_creation = False), the scheduler will schedule the execution as follows: 另外,据我了解,在气流中,当未暂停dag(或添加dags_are_paused_at_creation = False)时,调度程序将按以下方式调度执行:

  • 1st dag execution: instant after (start_date + interval) 第一次执行:在(start_date + interval)之后的瞬间
  • 2nd dag execution: instant after (start_date + (interval * 2)) 第2个dag执行:(start_date +(interval * 2))之后的瞬间
  • 3rd dag execution: instant after (start_date + (interval * 3)) 第三次dag执行:(start_date +(interval * 3))之后的瞬间

Is this correct assumption? 这是正确的假设吗?

UPDATE (7/30/2017) 更新(7/30/2017)

Based on the assumption above, I created this dag today (07/30/2017): 基于上述假设,我今天(2017年7月30日)创建了这个dag:

from airflow import DAG
from datetime import datetime
from airflow.operators.bash_operator import BashOperator

dag_options = {                
            'owner':             'Airflow',  
            'depends_on_past':   False,      
            'start_date':   
datetime(year=2017,month=7,day=30,hour=20,minute=10)
}

with DAG('test_dag_100', schedule_interval="*/10 * * * *", 
default_args=dag_options) as dag:
                task1 = BashOperator(      
                task_id='task_100', 
                bash_command='date',                
                dag=dag)      

which should start on (UTC): 应该从(UTC)开始:

  • 7/30/2017 20:20:00 7/30/2017 20:20:00
  • 7/30/2017 20:30:00 7/30/2017 20:30:00
  • 7/30/2017 20:40:00 7/30/2017 20:40:00

Unfortunately this is not happening. 不幸的是,这没有发生。 Here are some screen shots of my dashboard: 这是我的仪表板的一些屏幕截图:

Can someone explain why on 20:21:00 the dag did not execute? 有人可以解释为什么在20:21:00达格没有执行吗? after 20:31:00 it still did not execute... What am i missing here? 20:31:00之后,它仍然没有执行...我在这里想念什么?

By the way, I also noticed that, for some reason, that every time I go and kick off a dag manually through the dashboard, it just sits in the "running" stage. 顺便说一句,由于某种原因,我还注意到,每次我通过仪表板手动启动dag时,它都处于“运行”阶段。 Why is this? 为什么是这样? Does kicking it off manually have anything to do with any of the start timing options (start_date/interval/etc) ?? 手动启动它是否与任何开始计时选项(start_date / interval / etc)有关?

Thank you for any clarifications you can provide 感谢您提供的任何说明

Your assumptions are correct. 您的假设是正确的。 Airflow will schedule the first DAG run after the specified schedule interval has elapsed from the start date. 从开始日期开始经过指定的计划时间间隔后,Airflow将安排第一次DAG运行。 Using datetime.now() as the start date will results in Airflow rarely, if ever, triggering a DAG. 使用datetime.now()作为开始日期将导致Airflow很少触发DAG(如果有的话)。 It's mentioned in the scheduling docs. 在计划文档中提到了它。

If you were to specify a specific start date, such as datetime(2017,7,27,1,0) with a schedule interval of "5 * * * *", then at 1:05am on 7/27 the DAG will be triggered to run the first time. 如果您要指定一个特定的开始日期,例如datetime(2017,7,27,1,0),计划时间间隔为“ 5 * * * *”,那么DAG将在7/27的1:05 am触发第一次运行。 It'll continue to run every five minutes after that. 此后它将继续每五分钟运行一次。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM