简体   繁体   English

airflow - 创建 dag 和任务为一个 object 动态创建管道

[英]airflow - creating dag and task dynamically create the pipeline for one object

In airflow I want to export some tables from pg to BQ.在 airflow 我想将一些表从 pg 导出到 BQ。

task1: get the max id from BQ
task2: export the data from PG (id>maxid)
task3: GCS to BQ stage
task4: BQ stage to BQ main

But there is a slight challenge, The schedule interval is different.但是有一个小小的挑战,调度间隔不同。 So I created a JSON file to tell the sync interval.所以我创建了一个 JSON 文件来告诉同步间隔。 So if it is 2mins then it'll use the DAG upsert_2mins else 10mins interval ( upsert_10mins ).因此,如果它是 2 分钟,那么它将使用 DAG upsert_2mins否则 10 分钟间隔 ( upsert_10mins )。 I used this syntax to generate it dynamically.我使用此语法动态生成它。

JSON config file: JSON 配置文件:

{
    "tbl1": ["update_timestamp", "2mins", "stg"],
    "tbl2": ["update_timestamp", "2mins", "stg"]
}

Code:代码:

import json
from airflow import DAG
from datetime import datetime, timedelta
from airflow.utils.dates import days_ago
from airflow.contrib.hooks.bigquery_hook import BigQueryHook
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from customoperator.custom_PostgresToGCSOperator import  custom_PostgresToGCSOperator
from airflow.contrib.operators.gcs_to_bq import custom_PostgresToGoogleCloudStorageOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator


table_list = ['tbl1','tbl2']

#DAG details
docs = """test"""
# Add args and Dag
default_args = {
    'owner': 'DBteam',
    'depends_on_past': False,
    'start_date': days_ago(1),
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=1)
    }

 

with open('/home/airflow/gcs/dags/upsert_dag/config.json','r') as conf:
    config = json.loads(conf.read())

def get_max_ts(dag,tablename,**kwargs):
    code for find the max record
    return records[0][0]

def pgexport(dag,tablename, **kwargs):
    code for exporting the data PGtoGCS
    export_tables.execute(None)


def stg_bqimport(dag,tablename, **kwargs):
    code to import GCS to BQ
    bqload.execute(None)

def prd_merge(dag,tablename, **kwargs):
    code to merge bq to main bq table
    bqmerge.execute(context=kwargs)

for table_name in table_list:
    
    sync_interval = config[table_name][1]
    cron_time = ''
    if sync_interval == '2mins':
        cron_time = '*/20 * * * *'
    else:
        cron_time = '*/10 * * * *'
    
    dag = DAG(
    'upsert_every_{}'.format(sync_interval),
    default_args=default_args,
    description='Incremental load - Every 10mins',
    schedule_interval=cron_time,
    catchup=False,
    max_active_runs=1,
    doc_md = docs
    )
    
    max_ts = PythonOperator(
        task_id="get_maxts_{}".format(table_name),
        python_callable=get_max_ts,
        op_kwargs={'tablename':table_name, 'dag': dag},
        provide_context=True,
        dag=dag
    )
   
    export_gcs = PythonOperator(
    task_id='export_gcs_{}'.format(table_name),
    python_callable=pgexport,
    op_kwargs={'tablename':table_name, 'dag': dag},
    provide_context=True,
    dag=dag
    )

    stg_load = PythonOperator(
    task_id='stg_load_{}'.format(table_name),
    python_callable=stg_bqimport,
    op_kwargs={'tablename':table_name, 'dag': dag},
    provide_context=True,
    dag=dag
    )    
    merge = PythonOperator(
    task_id='merge_{}'.format(table_name),
    python_callable=prd_merge,
    op_kwargs={'tablename':table_name, 'dag': dag},
    provide_context=True,
    dag=dag
    )
    
    globals()[sync_interval] = dag
    max_ts >> export_gcs >> stg_load >> merge

It actually created the dag but the issue is from the web UI im able to see the task for the last table.But it has to show the tasks for 2 tables.它实际上创建了 dag,但问题来自 web UI,我无法看到最后一个表的任务。但它必须显示 2 个表的任务。 在此处输入图像描述

Your code is creating 2 dags, one for each table, but overwriting the first one with the second.您的代码正在创建 2 个 dag,每个表一个,但用第二个覆盖第一个。

My suggestion is to change the format of the JSON file to:我的建议是将JSON文件的格式改为:

{
    "2mins": [
                "tbl1": ["update_timestamp", "stg"],
                "tbl2": ["update_timestamp", "stg"]
             ],
    "10mins": [
                "tbl3": ["update_timestamp", "stg"],
                "tbl4": ["update_timestamp", "stg"]
             ]
}

And have your code iterate over the schedules and create the needed tasks for each table (you will need two loops):并让您的代码遍历计划并为每个表创建所需的任务(您将需要两个循环):

# looping on the schedules to create two dags
for schedule, tables in config.items():

cron_time = '*/10 * * * *'

if schedule== '2mins':
    cron_time = '*/20 * * * *'

dag_id = 'upsert_every_{}'.format(schedule)

dag = DAG(
    dag_id ,
    default_args=default_args,
    description='Incremental load - Every 10mins',
    schedule_interval=cron_time,
    catchup=False,
    max_active_runs=1,
    doc_md = docs
)

# Looping over the tables to create the tasks for 
# each table in the current schedule
for table_name, table_config in tables.items():
    max_ts = PythonOperator(
        task_id="get_maxts_{}".format(table_name),
        python_callable=get_max_ts,
        op_kwargs={'tablename':table_name, 'dag': dag},
        provide_context=True,
        dag=dag
    )

    export_gcs = PythonOperator(
        task_id='export_gcs_{}'.format(table_name),
        python_callable=pgexport,
        op_kwargs={'tablename':table_name, 'dag': dag},
        provide_context=True,
        dag=dag
    )

    stg_load = PythonOperator(
        task_id='stg_load_{}'.format(table_name),
        python_callable=stg_bqimport,
        op_kwargs={'tablename':table_name, 'dag': dag},
        provide_context=True,
        dag=dag
    )    

    merge = PythonOperator(
        task_id='merge_{}'.format(table_name),
        python_callable=prd_merge,
        op_kwargs={'tablename':table_name, 'dag': dag},
        provide_context=True,
        dag=dag
    )
    
    # Tasks for the same table will be chained
    max_ts >> export_gcs >> stg_load >> merge

# DAG is created among the global objects
globals()[dag_id] = dag

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM