简体   繁体   English

用气流创建一个 dag

[英]create a dag with airflow

I want to create a dag, which runs every hour, we have a cleanData process which imports and cleans files, followed by a storedata process which sends us a report twice a day at 8am and 6pm.我想创建一个DAG,它运行每隔一小时,我们有一个cleanData过程,进口和清除文件,随后storedata过程,给我们一份报告,每天两次在上午8点到下午六点。 How can I include the times in the dag?如何在 dag 中包含时间?

import pandas as pd
import numpy as np

from airflow import DAG
from datetime import datetime
from airflow.operators.python_operator import PythonOperator


default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2020, 12, 5),
    'retries': 1
}


def storeData(**context):
    df= context['task_instance'].xcom_pull(task_ids='clean_Data')
    print (df)


def cleanData(**context):
    data = {'Name': ['Tom', 'nick', 'krish', 'jack'],
            'Age': [np.nan, 21, np.nan, 18]}
    df = pd.DataFrame(data)
    df = df.fillna(0)
    return df


dag = DAG(
    'CleaningPipelineDAG',
    default_args=default_args,
    description='Cleaning Data',
    schedule_interval='@once',
)


t1 = PythonOperator(
    task_id='clean_Data',
    provide_context=True,
    python_callable=cleanData,
    dag=dag,
)

t2 = PythonOperator(
    task_id='store_data',
    provide_context=True,
    python_callable=storeData,
    dag=dag,
)

t1 >> t2

If the whole DAG can run twice a day (same scheduling for each task, without skipping), then you can use the following schedule_interval :如果整个 DAG 可以每天运行两次(每个任务的调度相同,没有跳过),那么您可以使用以下schedule_interval

dag = DAG(
    'CleaningPipelineDAG',
    default_args=default_args,
    description='Cleaning Data',
    schedule_interval='0 8,18 * * *',
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM