[英]create a dag with airflow
I want to create a dag, which runs every hour, we have a cleanData
process which imports and cleans files, followed by a storedata
process which sends us a report twice a day at 8am and 6pm.我想创建一个DAG,它运行每隔一小时,我们有一个
cleanData
过程,进口和清除文件,随后storedata
过程,给我们一份报告,每天两次在上午8点到下午六点。 How can I include the times in the dag?如何在 dag 中包含时间?
import pandas as pd
import numpy as np
from airflow import DAG
from datetime import datetime
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 12, 5),
'retries': 1
}
def storeData(**context):
df= context['task_instance'].xcom_pull(task_ids='clean_Data')
print (df)
def cleanData(**context):
data = {'Name': ['Tom', 'nick', 'krish', 'jack'],
'Age': [np.nan, 21, np.nan, 18]}
df = pd.DataFrame(data)
df = df.fillna(0)
return df
dag = DAG(
'CleaningPipelineDAG',
default_args=default_args,
description='Cleaning Data',
schedule_interval='@once',
)
t1 = PythonOperator(
task_id='clean_Data',
provide_context=True,
python_callable=cleanData,
dag=dag,
)
t2 = PythonOperator(
task_id='store_data',
provide_context=True,
python_callable=storeData,
dag=dag,
)
t1 >> t2
If the whole DAG can run twice a day (same scheduling for each task, without skipping), then you can use the following schedule_interval
:如果整个 DAG 可以每天运行两次(每个任务的调度相同,没有跳过),那么您可以使用以下
schedule_interval
:
dag = DAG(
'CleaningPipelineDAG',
default_args=default_args,
description='Cleaning Data',
schedule_interval='0 8,18 * * *',
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.