简体   繁体   中英

airflow trigger_dag execution_date is the next day, why?

Recently I have tested airflow so much that have one problem with execution_date when running airflow trigger_dag <my-dag> .

I have learned that execution_date is not what we think at first time from here :

Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for 2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be right after all data for 2016-02-19 becomes available.

start_date = datetime.combine(datetime.today(),
                              datetime.min.time())

args = {
    "owner": "xigua",
    "start_date": start_date
}
dag = DAG(dag_id="hadoopprojects", default_args=args,
          schedule_interval=timedelta(days=1))


wait_5m = ops.TimeDeltaSensor(task_id="wait_5m",
                              dag=dag,
                              delta=timedelta(minutes=5))

Above codes is the start part of my daily workflow, the first task is a TimeDeltaSensor that waits another 5 minutes before actual work, so this means my dag will be triggered at 2016-09-09T00:05:00 , 2016-09-10T00:05:00 ... etc.

In Web UI, I can see something like scheduled__2016-09-20T00:00:00 , and task is run at 2016-09-21T00:00:00 , which seems reasonable according to ETL model.

However someday my dag is not triggered for unknown reason, so I trigger it manually, if I trigger it at 2016-09-20T00:10:00 , then the TimeDeltaSensor will wait until 2016-09-21T00:15:00 before run.

This is not what I want, I want it to run at 2016-09-20T00:15:00 not the next day, I have tried passing execution_date through --conf '{"execution_date": "2016-09-20"}' , but it doesn't work.

How should I deal with this issue ?

$ airflow version
[2016-09-21 17:26:33,654] {__init__.py:36} INFO - Using executor LocalExecutor
  ____________       _____________
 ____    |__( )_________  __/__  /________      __
____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
 _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
   v1.7.1.3

First, I recommend you use constants for start_date , because dynamic ones would act unpredictably based on with your airflow pipeline is evaluated by the scheduler.

More information about start_date here in an FAQ entry that I wrote and sort all this out: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date

Now, about execution_date and when it is triggered, this is a common gotcha for people onboarding on Airflow. Airflow sets execution_date based on the left bound of the schedule period it is covering, not based on when it fires (which would be the right bound of the period). When running an schedule='@hourly' task for instance, a task will fire every hour. The task that fires at 2pm will have an execution_date of 1pm because it assumes that you are processing the 1pm to 2pm time window at 2pm. Similarly, if you run a daily job, the run an with execution_date of 2016-01-01 would trigger soon after midnight on 2016-01-02 .

This left-bound labelling makes a lot of sense when thinking in terms of ETL and differential loads, but gets confusing when thinking in terms of a simple, cron-like scheduler.

Airflow will provide the time in UTC. I am not sure at what timezone you are running the tasks. So make sure you think of UTC timezone and schedule or trigger the jobs accordingly.

Try converting the time you want to trigger to UTC time and trigger the DAG. it works. For more information, you can read the below link

https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM