简体   繁体   English

如何自动化整个 Python 脚本以通过 Airflow 运行?

[英]How to automate a whole Python script to run via Airflow?

I have a Python script that points to an elasticsearch cluster, performs aggregations and computations on data, and then stores insights in a local PostgreSQL.我有一个 Python 脚本,它指向一个 elasticsearch 集群,对数据执行聚合和计算,然后将见解存储在本地 PostgreSQL 中。 The script can be run daily or weekly depending on the user preferences as follows:该脚本可以根据用户偏好每天或每周运行,如下所示:

python script.py --approach daily 
python script.py --approach weekly

I want to automate this dataflow workflow process to be run every 10 minutes via Airflow.我想通过 Airflow 自动运行此数据流工作流程,每 10 分钟运行一次。

My guess is to go for the bashoperator as to create a task t1 = bashoperator that executes the bash command python script.py --approach daily as a DAG1, and t2 = bashoperator that executes the bash command python script.py --approach weekly My guess is to go for the bashoperator as to create a task t1 = bashoperator that executes the bash command python script.py --approach daily as a DAG1, and t2 = bashoperator that executes the bash command python script.py --approach weekly

The code didn't seem to give an appropriate result as the webUI of Airflow seems to schedule all jobs to scheduled.该代码似乎没有给出适当的结果,因为 Airflow 的 webUI 似乎将所有作业都安排到了计划中。

Can anyone tell me what I have been doing wrong?谁能告诉我我做错了什么?


#imports 
    from airflow.models import DAG
    from airflow.operators.bash_operator import BashOperator
    from  airflow.operators.python_operator import PythonOperator
    from  airflow.operators.email_operator import EmailOperator

    from datetime import datetime, timedelta

    seven_days_ago = datetime.combine(datetime.today() - timedelta(7),
                                        datetime.min.time())

    default_args = {
        
        'owner': 'me',
        'depends_on_past': False,
        'start_date': seven_days_ago,
        'email': ['me@gmail.com'],
        'email_on_failure': True,
        'email_on_retry': False
        'retries': 3,
        'max_tries' : 3 , 
        'retry_delay': timedelta(minutes=10) 

        }

    etl_dag = DAG('tester',default_args=default_args,schedule_interval= '@once')

    #the bashoperator to execute the bash command as to automate the task execution every 5 min 
    weekly_task = BashOperator(
    task_id='testing',
    bash_command='python  my_script.py --approach weekly',
    dag=etl_dag)

You have several approaches here:您在这里有几种方法:

  1. Write two DAGs.写两个 DAG。 One for daily and one for weekly.每天一份,每周一份。
  2. Write one DAG using the DayOfWeekBranchOpeator that will branch your workflow depending on the specific day.使用DayOfWeekBranchOpeator编写一个 DAG,该 DAG 将根据特定日期分支您的工作流。 For example: On Monday it will perform --approach weekly and in all other days it will perform --approach daily.例如:在星期一,它将--approach weekly ,而在所有其他日子,它将每天执行--approach

If you are running Airflow>= 2.1.0 (not yet released):如果您正在运行Airflow>= 2.1.0 (尚未发布):

from airflow.operators.weekday import BranchDayOfWeekOperator
from airflow.operators.bash import BashOperator

with DAG(
    dag_id='tester',
    default_args=default_args,
    schedule_interval='@daily'
) as dag:

    branch_op = BranchDayOfWeekOperator(
        task_id="branch_task",
        follow_task_ids_if_true="weekly_task",
        follow_task_ids_if_false="daily_task",
        week_day="MONDAY", # Replace with the day you want to execute the approch weekly
        use_task_execution_day=False, # Set true if you want the day to be checked against execution_date
    )
    weekly_op = BashOperator(
        task_id='weekly_task',
        bash_command='python  my_script.py --approach weekly',
    )
    daily_op = BashOperator(
        task_id='daily_task',
        bash_command='python  my_script.py --approach daily',
    )

    branch_op >> [weekly_op, daily_op]

If you are running Airflow< 2.1.0 :如果您正在运行Airflow< 2.1.0

Copy the DayOfWeekBranchOpeator code into your project, import it locally and use the same code as above.DayOfWeekBranchOpeator代码复制到您的项目中,将其导入本地并使用与上述相同的代码。 The DayOfWeekBranchOpeator is new in Airflow 2.1 release. DayOfWeekBranchOpeator是 Airflow 2.1 版本中的新功能。 Note that you might need to change a few imports in the operator depends which Airflow version you are running.请注意,您可能需要更改运算符中的一些导入,这取决于您正在运行的 Airflow 版本。

Based on your syntax, I'm going to assume you are running Airflow 1.10.14.根据您的语法,我假设您正在运行 Airflow 1.10.14。

I think you are trying to accomplish the following:我认为您正在尝试完成以下任务:

  • run my_script.py daily每天运行 my_script.py
  • run my_script.py weekly每周运行 my_script.py

You will need to alter your script to do exactly one workflow cycle.您将需要更改您的脚本以执行一个工作流程周期。 This is because Airflow will be executing your script each schedule interval.这是因为 Airflow 将在每个计划间隔执行您的脚本。 If your script is ran and has an internal scheduling process (assuming that's what --approach determines), it will conflict with Airflow core purpose, which is to manage your workflows.如果您的脚本已运行并且具有内部调度过程(假设这是--approach确定的内容),它将与 Airflow 的核心目的相冲突,即管理您的工作流程。 Each time Airflow runs the script, it'll start another process that runs weekly (for --approach weekly ) multiplying the workflows.每次 Airflow 运行脚本时,它都会启动另一个每周运行的进程(对于--approach weekly ),从而使工作流程相乘。

You want Airflow to schedule your workflow, not the script.您希望 Airflow 安排您的工作流程,而不是脚本。


Two DAGs with separate schedule_interval will work here.具有单独 schedule_interval 的两个 DAG 将在这里工作。

daily_dag.py daily_dag.py

from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator

from datetime import datetime

dag = DAG(
    dag_id='daily_dag',
    start_date=datetime(2021, 1, 1),
    schedule_interval='0 7 * * *'
)
with dag:
    run_script = BashOperator(
        task_id='run_script',
        bash_command='python my_script.py',
    )

This DAG will run the workflow every day at 7 UTC.该 DAG 将在每天 7 点 UTC 运行工作流。

weekly_run.py每周运行.py

from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator

from datetime import datetime

dag = DAG(
    dag_id='weekly_dag',
    start_date=datetime(2021, 1, 1),
    schedule_interval='0 7 * * Mon'
)
with dag:
    run_script = BashOperator(
        task_id='run_script',
        bash_command='python my_script.py',
    )

This DAG will run the workflow every Monday at 7 UTC.此 DAG 将在每周一 7 UTC 运行工作流。

Keeping them separate allows you to have a cleaner pipeline as well as clearer definition of what each pipeline does at their own frequencies.将它们分开可以让您拥有更清晰的管道以及更清晰地定义每个管道在自己的频率下所做的事情。

I would also look into converting those BashOperator into PythonOperators to remove a layer of abstraction by executing the python code directly vs through a shell.我还将考虑通过直接执行 python 代码而不是通过 shell 将这些 BashOperator 转换为PythonOperators以删除抽象层。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM