簡體   English   中英

airflow - 創建 dag 和任務為一個 object 動態創建管道

[英]airflow - creating dag and task dynamically create the pipeline for one object

在 airflow 我想將一些表從 pg 導出到 BQ。

task1: get the max id from BQ
task2: export the data from PG (id>maxid)
task3: GCS to BQ stage
task4: BQ stage to BQ main

但是有一個小小的挑戰,調度間隔不同。 所以我創建了一個 JSON 文件來告訴同步間隔。 因此,如果它是 2 分鍾,那么它將使用 DAG upsert_2mins否則 10 分鍾間隔 ( upsert_10mins )。 我使用此語法動態生成它。

JSON 配置文件:

{
    "tbl1": ["update_timestamp", "2mins", "stg"],
    "tbl2": ["update_timestamp", "2mins", "stg"]
}

代碼:

import json
from airflow import DAG
from datetime import datetime, timedelta
from airflow.utils.dates import days_ago
from airflow.contrib.hooks.bigquery_hook import BigQueryHook
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from customoperator.custom_PostgresToGCSOperator import  custom_PostgresToGCSOperator
from airflow.contrib.operators.gcs_to_bq import custom_PostgresToGoogleCloudStorageOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator


table_list = ['tbl1','tbl2']

#DAG details
docs = """test"""
# Add args and Dag
default_args = {
    'owner': 'DBteam',
    'depends_on_past': False,
    'start_date': days_ago(1),
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=1)
    }

 

with open('/home/airflow/gcs/dags/upsert_dag/config.json','r') as conf:
    config = json.loads(conf.read())

def get_max_ts(dag,tablename,**kwargs):
    code for find the max record
    return records[0][0]

def pgexport(dag,tablename, **kwargs):
    code for exporting the data PGtoGCS
    export_tables.execute(None)


def stg_bqimport(dag,tablename, **kwargs):
    code to import GCS to BQ
    bqload.execute(None)

def prd_merge(dag,tablename, **kwargs):
    code to merge bq to main bq table
    bqmerge.execute(context=kwargs)

for table_name in table_list:
    
    sync_interval = config[table_name][1]
    cron_time = ''
    if sync_interval == '2mins':
        cron_time = '*/20 * * * *'
    else:
        cron_time = '*/10 * * * *'
    
    dag = DAG(
    'upsert_every_{}'.format(sync_interval),
    default_args=default_args,
    description='Incremental load - Every 10mins',
    schedule_interval=cron_time,
    catchup=False,
    max_active_runs=1,
    doc_md = docs
    )
    
    max_ts = PythonOperator(
        task_id="get_maxts_{}".format(table_name),
        python_callable=get_max_ts,
        op_kwargs={'tablename':table_name, 'dag': dag},
        provide_context=True,
        dag=dag
    )
   
    export_gcs = PythonOperator(
    task_id='export_gcs_{}'.format(table_name),
    python_callable=pgexport,
    op_kwargs={'tablename':table_name, 'dag': dag},
    provide_context=True,
    dag=dag
    )

    stg_load = PythonOperator(
    task_id='stg_load_{}'.format(table_name),
    python_callable=stg_bqimport,
    op_kwargs={'tablename':table_name, 'dag': dag},
    provide_context=True,
    dag=dag
    )    
    merge = PythonOperator(
    task_id='merge_{}'.format(table_name),
    python_callable=prd_merge,
    op_kwargs={'tablename':table_name, 'dag': dag},
    provide_context=True,
    dag=dag
    )
    
    globals()[sync_interval] = dag
    max_ts >> export_gcs >> stg_load >> merge

它實際上創建了 dag,但問題來自 web UI,我無法看到最后一個表的任務。但它必須顯示 2 個表的任務。 在此處輸入圖像描述

您的代碼正在創建 2 個 dag,每個表一個,但用第二個覆蓋第一個。

我的建議是將JSON文件的格式改為:

{
    "2mins": [
                "tbl1": ["update_timestamp", "stg"],
                "tbl2": ["update_timestamp", "stg"]
             ],
    "10mins": [
                "tbl3": ["update_timestamp", "stg"],
                "tbl4": ["update_timestamp", "stg"]
             ]
}

並讓您的代碼遍歷計划並為每個表創建所需的任務(您將需要兩個循環):

# looping on the schedules to create two dags
for schedule, tables in config.items():

cron_time = '*/10 * * * *'

if schedule== '2mins':
    cron_time = '*/20 * * * *'

dag_id = 'upsert_every_{}'.format(schedule)

dag = DAG(
    dag_id ,
    default_args=default_args,
    description='Incremental load - Every 10mins',
    schedule_interval=cron_time,
    catchup=False,
    max_active_runs=1,
    doc_md = docs
)

# Looping over the tables to create the tasks for 
# each table in the current schedule
for table_name, table_config in tables.items():
    max_ts = PythonOperator(
        task_id="get_maxts_{}".format(table_name),
        python_callable=get_max_ts,
        op_kwargs={'tablename':table_name, 'dag': dag},
        provide_context=True,
        dag=dag
    )

    export_gcs = PythonOperator(
        task_id='export_gcs_{}'.format(table_name),
        python_callable=pgexport,
        op_kwargs={'tablename':table_name, 'dag': dag},
        provide_context=True,
        dag=dag
    )

    stg_load = PythonOperator(
        task_id='stg_load_{}'.format(table_name),
        python_callable=stg_bqimport,
        op_kwargs={'tablename':table_name, 'dag': dag},
        provide_context=True,
        dag=dag
    )    

    merge = PythonOperator(
        task_id='merge_{}'.format(table_name),
        python_callable=prd_merge,
        op_kwargs={'tablename':table_name, 'dag': dag},
        provide_context=True,
        dag=dag
    )
    
    # Tasks for the same table will be chained
    max_ts >> export_gcs >> stg_load >> merge

# DAG is created among the global objects
globals()[dag_id] = dag

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM