简体   繁体   中英

Scheduling Airflow DAGs to run exclusively Monday through Friday i.e only weekdays

I have a DAG executing a Python script which takes a date argument (the current date). I'm scheduling the DAG to run at 6:00 AM Monday through Friday ie weekdays Eastern Standard Time. The DAG has to run the Python script on Monday with Mondays date as an argument, same for Tuesday all the way to Friday with Fridays date as an argument.

I noticed using a schedule interval of '0 6 * * 1-5' didn't work because Fridays execution didn't occur until the following Monday.

I changed the schedule interval to '0 6 * * *' to run everyday at 6:00 AM and at the start of my dag, filter for dates that fall within '0 6 * * 1-5' , so effectively Monday to Friday. For Saturday and Sunday, the downstream tasks should be skipped.

This is my code

from __future__ import print_function
import pendulum
import logging
from airflow.models import DAG
from airflow.models import Variable
from datetime import datetime, timedelta
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.operators.python_operator import ShortCircuitOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule
from croniter import croniter


log = logging.getLogger(__name__)

def filter_processing_date(**context):
    execution_date = context['execution_date']
    cron = croniter('0 6 * * 1-5', execution_date)
    log.info('cron is: {}'.format(cron))
    log.info('execution date is: {}'.format(execution_date))
    #prev_date = cron.get_prev(datetime)
    #log.info('prev_date is: {}'.format(prev_date))
    return execution_date == cron.get_next(datetime).get_prev(datetime)


local_tz = pendulum.timezone("America/New_York")
# DAG parameters

default_args = {
    'owner': 'Managed Services',
    'depends_on_past': False,
    'start_date': datetime(2020, 8, 3, tzinfo=local_tz),
    'dagrun_timeout': None,
    'email': Variable.get('email'),
    'email_on_failure': True,
    'email_on_retry': False,
    'provide_context': True,
    'retries': 12,
    'retry_delay': timedelta(minutes=5)
}

with DAG(
    'execute_python',
    schedule_interval='0 6 * * *',
    default_args=default_args
    ) as dag:

    start_dummy = DummyOperator(
        task_id='start',
        dag=dag
    )

    end_dummy = DummyOperator(
        task_id='end',
        trigger_rule=TriggerRule.NONE_FAILED,
        dag=dag
    )

    weekdays_only = ShortCircuitOperator(
        task_id='weekdays_only',
        python_callable=filter_processing_date,
        dag=dag
    )


    run_python = SSHOperator(
    ssh_conn_id="oci_connection",
    task_id='run_python',
    command='/usr/bin/python3  /home/sb/local/bin/runProcess.py -d {{ ds_nodash }}',
    dag=dag)


    start_dummy >> weekdays_only >> run_python >> end_dummy

Unfortunately, weekdays_only task is failing with the below error message. What is going wrong?

Airflow error message

Airflow error message continuation

Airflow version: v1.10.9-composer

Python 3.

I managed to solve my problem by hacking something together. Checking if the next execution date is a weekday and returning true if it's the case or false otherwise. I call the function in a ShortCircuitOperator which proceeds with downstream tasks if true or skips them if false.

This is my code below but I'm open to better solutions.

from __future__ import print_function
import pendulum
import logging
from airflow.models import DAG
from airflow.models import Variable
from datetime import datetime, timedelta
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.operators.python_operator import ShortCircuitOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule


log = logging.getLogger(__name__)


def checktheday(**context):
    next_execution_date = context['next_execution_date']
    log.info('next_execution_date is: {}'.format(next_execution_date))
    date_check = next_execution_date.weekday()
    log.info('date_check is: {}'.format(date_check))
    if date_check == 0 or date_check == 1 or date_check == 2 or date_check == 3 or date_check == 4:
        decision = True
    else:
        decision = False

    log.info('decision is: {}'.format(decision))
    return decision


local_tz = pendulum.timezone("America/New_York")
# DAG parameters

default_args = {
    'owner': 'Managed Services',
    'depends_on_past': False,
    'start_date': datetime(2020, 8, 3, tzinfo=local_tz),
    'dagrun_timeout': None,
    'email': Variable.get('email'),
    'email_on_failure': True,
    'email_on_retry': False,
    'provide_context': True,
    'retries': 12,
    'retry_delay': timedelta(minutes=5)
}

with DAG(
    'execute_python',
    schedule_interval='0 6 * * *',
    default_args=default_args
    ) as dag:

    start_dummy = DummyOperator(
        task_id='start',
        dag=dag
    )

    end_dummy = DummyOperator(
        task_id='end',
        trigger_rule=TriggerRule.NONE_FAILED,
        dag=dag
    )

    weekdays_only = ShortCircuitOperator(
        task_id='weekdays_only',
        python_callable=checktheday,
        dag=dag
    )


    run_python = SSHOperator(
    ssh_conn_id="oci_connection",
    task_id='run_python',
    command='/usr/bin/python3  /home/sb/local/bin/runProcess.py -d {{ macros.ds_format(macros.ds_add(ds, 1), "%Y-%m-%d", "%Y%m%d") }}',
    dag=dag)


    start_dummy >> weekdays_only >> run_python >> end_dummy

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM