简体   繁体   English

气流回填和新的dag运行

[英]Airflow backfills and new dag runs

I have a DAG which has "DAG runs" since 1 Jan 2015 until today, scheduled every day. 我有一个DAG,自2015年1月1日至今,每天都有“DAG运行”。 Tasks in the DAG are not "past dependent", meaning during a backfill they can be executed in any date order. DAG中的任务不是“过去依赖”,这意味着在回填期间它们可以按任何日期顺序执行。

If I need to backfill a task in the DAG, I clear all the task instances (from today to past) using UI, then all DAG runs switch to "running" state and the task start backfilling from 1 Jan 2015 to today. 如果我需要在DAG中回填任务,我使用UI清除所有任务实例(从今天到过去),然后所有DAG运行切换到“运行”状态,任务从2015年1月1日开始回填到今天。 Tasks are time consuming, so even when executed in parallel by multiple threads / workers, the backfill can finish only in a few days. 任务非常耗时,因此即使由多个线程/工作人员并行执行,回填也只能在几天内完成。

The problem is that new "DAG runs" for tomorrow, the day after tomorrow, etc. won't be added by scheduler until the backfill is finished so we fail to calculate new days' data on time. 问题是,在回填完成之前,调度程序不会添加明天,后天等新的“DAG运行”,因此我们无法按时计算新天数据。 Is there any way to prioritize tasks for new days as they come and continue backfilling after tasks for the new day are finished? 是否有任何方法可以确定新时代的任务优先级,并在新的一天任务结束后继续回填?

PS Backfilling can also be done using "airflow backfill" CLI but this approach has its own problems so for now I'm interested in the backfilling technique described above. PS回填也可以使用“气流回填”CLI完成,但这种方法有其自身的问题,所以现在我对上述回填技术感兴趣。

Similar to the comment on your question, the way I solved this as a work around when I was backfilling a large database was to have a dag generator create three dags (two backfill and one ongoing) based on connection_created_on and start_date values. 与您的问题的评论类似,我在回填大型数据库时解决这个问题的方法是让dag生成器根据connection_created_onstart_date值创建三个dag(两个回填和一个正在进行)。

The ongoing dag runs hourly and begins at midnight the same day as the connection_created_on value. 正在进行的dag每小时运行一次,并在connection_created_on值的同一天午夜开始。 The two backfills then pull daily starting on the first of the current month and then monthly starting with the first month of the start_date . 然后两个回填每天从当月的第一个月开始,然后从start_date的第一个月开始每月。 In this case, I knew that we would always want to start on the first of the month and that data up to a month in scope was small enough to be pulled together so I split it up into these three dag types for expediency. 在这种情况下,我知道我们总是希望在本月的第一天开始,并且范围内长达一个月的数据小到可以拉到一起,所以为了方便起见,我将它分成这三种dag类型。

def create_dag(dag_id,
           schedule,
           db_conn_id,
           default_args,
           catchup=False,
           max_active_runs=3):

    dag = DAG(dag_id,
              default_args=default_args,
              schedule_interval=schedule,
              catchup=catchup,
              max_active_runs=max_active_runs
              )
    with dag:
        kick_off_dag = DummyOperator(task_id='kick_off_dag')

    return dag

db_conn_id = 'my_first_db_conn'
connection_created_on = '2018-05-17 12:30:54.271Z'

hourly_id = '{}_to_redshift_hourly'.format(db_conn_id)
daily_id = '{}_to_redshift_daily_backfill'.format(db_conn_id)
monthly_id = '{}_to_redshift_monthly_backfill'.format(db_conn_id)

start_date = '2005-01-01 00:00:00.000Z'
start_date = datetime.strptime(start_date, '%Y-%m-%dT%H:%M:%S.%fZ')
start_date = datetime(start_date.year, start_date.month, 1)

cco_datetime = datetime.strptime(connection_created_on, '%Y-%m-%dT%H:%M:%S.%fZ')
hourly_start_date = datetime(cco_datetime.year, cco_datetime.month, cco_datetime.day)
daily_start_date = hourly_start_date - timedelta(days=(cco_datetime.day-1))
daily_end_date = hourly_start_date - timedelta(days=1)
monthly_start_date = start_date if start_date else hourly_start_date - timedelta(days=365+cco_datetime.day-1)
monthly_end_date = daily_start_date

globals()[hourly_id] = create_dag(hourly_id,
                                  '@hourly',
                                  db_conn_id,
                                  {'start_date': hourly_start_date,
                                   'retries': 2,
                                   'retry_delay': timedelta(minutes=5),
                                   'email': [],
                                   'email_on_failure': True,
                                   'email_on_retry': False},
                                  catchup=True,
                                  max_active_runs=1)

globals()[daily_id] = create_dag(daily_id,
                                 '@daily',
                                 db_conn_id,
                                 {'start_date': daily_start_date,
                                  'end_date': daily_end_date,
                                  'retries': 2,
                                  'retry_delay': timedelta(minutes=5),
                                  'email': [],
                                  'email_on_failure': True,
                                  'email_on_retry': False},
                                 catchup=True)

globals()[monthly_id] = create_dag(monthly_id,
                                   '@monthly',
                                   db_conn_id,
                                   {'start_date': monthly_start_date,
                                    'end_date': monthly_end_date,
                                    'retries': 2,
                                    'retry_delay': timedelta(minutes=5),
                                    'email': [],
                                    'email_on_failure': True,
                                    'email_on_retry': False},
                                   catchup=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM