[英]Python script scheduling in airflow
Hi everyone,嗨,大家好,
I need to schedule my python files(which contains data extraction from sql and some joins) using airflow.我需要使用气流来安排我的 python文件(其中包含从 sql 和一些连接中提取的数据) 。 I have successfully installed airflow into my linux server and webserver of airflow is available with me.我已成功将气流安装到我的 linux 服务器中,并且我可以使用气流网络服务器。 But even after going through documentation I am not clear where exactly I need to write script for scheduling and how will that script be available into airflow webserver so I could see the status但即使在阅读了文档之后,我也不清楚我到底需要在哪里编写调度脚本以及该脚本如何在气流网络服务器中可用,以便我可以看到状态
As far as the configuration is concerned I know where the dag folder is located in my home directory and also where example dags are located.就配置而言,我知道 dag 文件夹在我的主目录中的位置以及示例 dag 所在的位置。
Note: Please dont mark this as duplicate with How to run bash script file in Airflow as I need to run python files lying in some different location.注意:请不要使用 How to run bash script file in Airflow 将其标记为重复,因为我需要运行位于不同位置的 python 文件。
You should probably use the PythonOperator
to call your function.您可能应该使用PythonOperator
来调用您的函数。 If you want to define the function somewhere else, you can simply import it from a module as long as it's accessible in your PYTHONPATH
.如果你想在其他地方定义函数,你可以简单地从模块导入它,只要它可以在你的PYTHONPATH
访问。
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from my_script import my_python_function
dag = DAG('tutorial', default_args=default_args)
PythonOperator(dag=dag,
task_id='my_task_powered_by_python',
provide_context=False,
python_callable=my_python_function,
op_args=['arguments_passed_to_callable'],
op_kwargs={'keyword_argument':'which will be passed to function'})
If your function my_python_function
was in a script file /path/to/my/scripts/dir/my_script.py
如果您的函数my_python_function
在脚本文件/path/to/my/scripts/dir/my_script.py
Then before starting Airflow, you could add the path to your scripts to the PYTHONPATH
like so:然后在启动 Airflow 之前,您可以将脚本的路径添加到PYTHONPATH
如下所示:
export PYTHONPATH=/path/to/my/scripts/dir/:$PYTHONPATH
More information here: https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html更多信息在这里: https : //airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html
Default args and other considerations as in the tutorial: https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html教程中的默认参数和其他注意事项: https : //airflow.apache.org/docs/apache-airflow/stable/tutorial.html
You can also use bashoperator to execute python scripts in Airflow.您还可以使用 bashoperator 在 Airflow 中执行 python 脚本。 You can put your scripts in a folder in DAG folder.您可以将脚本放在 DAG 文件夹中的文件夹中。 If your scripts are somewhere else, just give a path to those scripts.如果您的脚本在其他地方,只需给出这些脚本的路径。
from airflow import DAG
from airflow.operators import BashOperator,PythonOperator
from datetime import datetime, timedelta
seven_days_ago = datetime.combine(datetime.today() - timedelta(7),
datetime.min.time())
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': seven_days_ago,
'email': ['airflow@airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('simple', default_args=default_args)
t1 = BashOperator(
task_id='testairflow',
bash_command='python /home/airflow/airflow/dags/scripts/file1.py',
dag=dag)
Airflow parses all Python files in $AIRFLOW_HOME/dags (in your case /home/amit/airflow/dags). Airflow 解析 $AIRFLOW_HOME/dags 中的所有 Python 文件(在您的情况下为 /home/amit/airflow/dags)。 And that python script should retrun a DAG object back as shown in answer from "postrational".并且该 python 脚本应该重新运行一个 DAG 对象,如“postrational”的回答所示。 Now when it is being reported as missing that means there is some issue in Python code and Airflow could not load it.现在,当它被报告为丢失时,这意味着 Python 代码中存在一些问题并且 Airflow 无法加载它。 Check airflow webserver or scheduler logs for more details, as stderr or stdout goes there.检查气流网络服务器或调度程序日志以获取更多详细信息,因为 stderr 或 stdout 在那里。
airflow webserver -p <port>
It loads examples dags automatically, It can be disable in $HOME/airflow/airflow.cfg它会自动加载示例 dag,可以在 $HOME/airflow/airflow.cfg 中禁用它
`load_examples = False`
Do some experiments, make changes in tutorial.py.做一些实验,在tutorial.py 中进行更改。 If you are giving schedule_interval as cron syntax, then 'start_date' : datetime(2017, 7, 7)
如果您提供 schedule_interval 作为 cron 语法,则'start_date' : datetime(2017, 7, 7)
'start_date': datetime.now()
dag = DAG('tutorial', default_args=default_args,schedule_interval="@once")
or dag = DAG('tutorial', default_args=default_args,schedule_interval="* * * * *") # schedule each minute
dag = DAG('tutorial', default_args=default_args,schedule_interval="@once")
或dag = DAG('tutorial', default_args=default_args,schedule_interval="* * * * *") # schedule each minute
start airflow: $ airflow webserver -p <port>
启动气流: $ airflow webserver -p <port>
$ airflow scheduler
启动调度程序: $ airflow scheduler
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.