简体   繁体   English

气流中的 Python 脚本调度

[英]Python script scheduling in airflow

Hi everyone,嗨,大家好,

I need to schedule my python files(which contains data extraction from sql and some joins) using airflow.我需要使用气流来安排我的 python文件(其中包含从 sql 和一些连接中提取的数据) I have successfully installed airflow into my linux server and webserver of airflow is available with me.我已成功将气流安装到我的 linux 服务器中,并且我可以使用气流网络服务器。 But even after going through documentation I am not clear where exactly I need to write script for scheduling and how will that script be available into airflow webserver so I could see the status但即使在阅读了文档之后,我也不清楚我到底需要在哪里编写调度脚本以及该脚本如何在气流网络服务器中可用,以便我可以看到状态

As far as the configuration is concerned I know where the dag folder is located in my home directory and also where example dags are located.就配置而言,我知道 dag 文件夹在我的主目录中的位置以及示例 dag 所在的位置。

Note: Please dont mark this as duplicate with How to run bash script file in Airflow as I need to run python files lying in some different location.注意:请不要使用 How to run bash script file in Airflow 将其标记为重复,因为我需要运行位于不同位置的 python 文件。

Please find the configuration in Airflow webserver as :请在 Airflow 网络服务器中找到配置为:

在此处输入图片说明

Below is the screenshot of dag folder in AIRFLOW_HOME dir下面是AIRFLOW_HOME目录下dag文件夹的截图

在此处输入图片说明

Also find the below screenshot for DAG creation screenshot and Missing DAG error还可以找到以下 DAG 创建屏幕截图和 Missing DAG 错误的屏幕截图

在此处输入图片说明

在此处输入图片说明

After i select the simple DAG following error of missing DAG is populated在我选择简单的DAG 后,会填充缺失 DAG 的错误

在此处输入图片说明

You should probably use the PythonOperator to call your function.您可能应该使用PythonOperator来调用您的函数。 If you want to define the function somewhere else, you can simply import it from a module as long as it's accessible in your PYTHONPATH .如果你想在其他地方定义函数,你可以简单地从模块导入它,只要它可以在你的PYTHONPATH访问。

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

from my_script import my_python_function

dag = DAG('tutorial', default_args=default_args)

PythonOperator(dag=dag,
               task_id='my_task_powered_by_python',
               provide_context=False,
               python_callable=my_python_function,
               op_args=['arguments_passed_to_callable'],
               op_kwargs={'keyword_argument':'which will be passed to function'})

If your function my_python_function was in a script file /path/to/my/scripts/dir/my_script.py如果您的函数my_python_function在脚本文件/path/to/my/scripts/dir/my_script.py

Then before starting Airflow, you could add the path to your scripts to the PYTHONPATH like so:然后在启动 Airflow 之前,您可以将脚本的路径添加到PYTHONPATH如下所示:

export PYTHONPATH=/path/to/my/scripts/dir/:$PYTHONPATH

More information here: https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html更多信息在这里: https : //airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html

Default args and other considerations as in the tutorial: https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html教程中的默认参数和其他注意事项: https : //airflow.apache.org/docs/apache-airflow/stable/tutorial.html

You can also use bashoperator to execute python scripts in Airflow.您还可以使用 bashoperator 在 Airflow 中执行 python 脚本。 You can put your scripts in a folder in DAG folder.您可以将脚本放在 DAG 文件夹中的文件夹中。 If your scripts are somewhere else, just give a path to those scripts.如果您的脚本在其他地方,只需给出这些脚本的路径。

    from airflow import DAG
    from airflow.operators import BashOperator,PythonOperator
    from datetime import datetime, timedelta

    seven_days_ago = datetime.combine(datetime.today() - timedelta(7),
                                      datetime.min.time())

    default_args = {
        'owner': 'airflow',
        'depends_on_past': False,
        'start_date': seven_days_ago,
        'email': ['airflow@airflow.com'],
        'email_on_failure': False,
        'email_on_retry': False,
        'retries': 1,
        'retry_delay': timedelta(minutes=5),
      }

    dag = DAG('simple', default_args=default_args)
t1 = BashOperator(
    task_id='testairflow',
    bash_command='python /home/airflow/airflow/dags/scripts/file1.py',
    dag=dag)

Airflow parses all Python files in $AIRFLOW_HOME/dags (in your case /home/amit/airflow/dags). Airflow 解析 $AIRFLOW_HOME/dags 中的所有 Python 文件(在您的情况下为 /home/amit/airflow/dags)。 And that python script should retrun a DAG object back as shown in answer from "postrational".并且该 python 脚本应该重新运行一个 DAG 对象,如“postrational”的回答所示。 Now when it is being reported as missing that means there is some issue in Python code and Airflow could not load it.现在,当它被报告为丢失时,这意味着 Python 代码中存在一些问题并且 Airflow 无法加载它。 Check airflow webserver or scheduler logs for more details, as stderr or stdout goes there.检查气流网络服务器或调度程序日志以获取更多详细信息,因为 stderr 或 stdout 在那里。

  1. Install airflow using Airflow official documentation.使用 Airflow 官方文档安装气流。 Its good idea to install in python virtual environment.安装在python虚拟环境中是个好主意。 http://python-guide-pt-br.readthedocs.io/en/latest/dev/virtualenvs/ http://python-guide-pt-br.readthedocs.io/en/latest/dev/virtualenvs/
  2. When we start airflow first time using当我们第一次开始使用气流时

airflow webserver -p <port>

It loads examples dags automatically, It can be disable in $HOME/airflow/airflow.cfg它会自动加载示例 dag,可以在 $HOME/airflow/airflow.cfg 中禁用它

`load_examples = False`
  1. Create dags folder in $HOME/airflow/, put tutorial.py file in dags folder from https://airflow.incubator.apache.org/tutorial.html在 $HOME/airflow/ 中创建dags文件夹,将 tutorial.py 文件放在https://airflow.incubator.apache.org/tutorial.html中的dags文件夹中
  2. Do some experiments, make changes in tutorial.py.做一些实验,在tutorial.py 中进行更改。 If you are giving schedule_interval as cron syntax, then 'start_date' : datetime(2017, 7, 7)如果您提供 schedule_interval 作为 cron 语法,则'start_date' : datetime(2017, 7, 7)

     'start_date': datetime.now()

    dag = DAG('tutorial', default_args=default_args,schedule_interval="@once") or dag = DAG('tutorial', default_args=default_args,schedule_interval="* * * * *") # schedule each minute dag = DAG('tutorial', default_args=default_args,schedule_interval="@once")dag = DAG('tutorial', default_args=default_args,schedule_interval="* * * * *") # schedule each minute

  3. start airflow: $ airflow webserver -p <port>启动气流: $ airflow webserver -p <port>

  4. start scheduler: $ airflow scheduler启动调度程序: $ airflow scheduler

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM