简体   繁体   English

有没有办法将任务的返回值存储在 Python 变量中并与下游任务共享(不使用 xcom 或 airflow 变量)

[英]Is there any way to store the return value of a task in Python variable and share it with downstream tasks (without using xcom or airflow variable)

I am writing an airflow dag which will read a bunch of configs from the database and will then execute a series of Python scripts using bash operator.我正在编写一个 airflow dag,它将从数据库中读取一堆配置,然后使用 bash 运算符执行一系列 Python 脚本。 The configs which were read previously will be passed as arguments.之前读取的配置将作为 arguments 传递。

The problem is I am not getting an efficient way to share the config with the other downstream operators.问题是我没有获得与其他下游运营商共享配置的有效方法。 I designed the below dag.我设计了下面的 dag。 Below are my concerns.以下是我的担忧。

  1. I am not sure how many DB calls will be made to fetch the values required inside the jinja templates (in the below example).我不确定将进行多少 DB 调用来获取 jinja 模板中所需的值(在下面的示例中)。

  2. Besides as the config is the same in every task, I am not sure if it's a good idea to fetch it every time from the database.此外,由于每个任务的配置都是相同的,我不确定每次从数据库中获取它是否是个好主意。 That's why I don't want to use xcom also.这就是为什么我也不想使用 xcom 的原因。 I used the airflow variable because the JSON parsing can happen in a single line.我使用了 airflow 变量,因为 JSON 解析可以在一行中进行。 But still, the database call issue is there I guess.但是,我猜仍然存在数据库调用问题。

class ReturningMySqlOperator(MySqlOperator):
    def execute(self, context):
        hook = MySqlHook(mysql_conn_id=self.mysql_conn_id,
                         schema=self.database)
        s = hook.get_pandas_df(sql=self.sql, parameters=self.parameters)
        s = s.set_index('laptopName', drop=False)
        print(s)
        s = s.to_json(orient='index')
        Variable.set('jobconfig', s)



t1 = ReturningMySqlOperator(
    task_id='mysql_query',
    sql='SELECT * FROM laptops',
    mysql_conn_id='mysql_db_temp',
    dag=dag)



t3 = BashOperator(
    task_id='sequence_one',
    bash_command='python3 path/sequence1.py {{var.json.jobconfig.Legion.laptopName}} {{var.json.jobconfig.Legion.company}}',
    dag=dag)

t4 = BashOperator(
    task_id='sequence_two',
    bash_command='python3 path/sequence2.py {{var.json.jobconfig.Legion.laptopName}} {{var.json.jobconfig.Legion.company}}',
    dag=dag)

t5 = BashOperator(
    task_id='sequence_three',
    bash_command='python3 path/sequence3.py {{var.json.jobconfig.Legion.laptopName}} {{var.json.jobconfig.Legion.company}}',
    dag=dag)

t6 = BashOperator(
    task_id='sequence_four',
    bash_command='python3 path/sequence4.py {{var.json.jobconfig.Legion.laptopName}} {{var.json.jobconfig.Legion.company}}',
    dag=dag)

t1 >> t3 
t3 >> [t4,t6]

First point:第一点:

I am not sure how many DB calls will be made to fetch the values required inside the jinja templates (in the below example).我不确定将进行多少 DB 调用来获取 jinja 模板中所需的值(在下面的示例中)。

In the example you provided, you are making two connections to the metadata DB in each sequence_x task, one per each {{var.json.jobconfig.xx}} call.在您提供的示例中,您在每个sequence_x任务中建立两个元数据数据库连接,每个{{var.json.jobconfig.xx}}调用一个。 The good news is that those are not being executed by the scheduler so are not being done every heartbeat interval.好消息是调度程序没有执行这些操作,因此不是在每个心跳间隔都执行。 From Astronomer guide :来自天文学家指南

Since all top-level code in DAG files is interpreted every scheduler "heartbeat," macros and templating allow run-time tasks to be offloaded to the executor instead of the scheduler.由于 DAG 文件中的所有顶级代码都会在每个调度程序“心跳”中进行解释,因此宏和模板允许将运行时任务卸载到执行程序而不是调度程序。

Second point:第二点:

I think the key aspect here is that the value you want to pass downstream is always the same and won't change after you executed T1 .我认为这里的关键方面是您要向下游传递的值始终相同,并且在执行T1后不会改变。 There may be a few approaches here, but if you want to minimize the number of calls to the DB, and avoid XComs at all, you should use the TriggerDagRunOperator .这里可能有几种方法,但如果你想尽量减少对数据库的调用次数,并完全避免XComs ,你应该使用TriggerDagRunOperator

To do so you have to split your DAG into two parts, having the controller DAG with the task where you fetch the data from MySQL , triggering a second DAG where you execute all of the BashOperator using the values you obtained from the controller DAG . To do so you have to split your DAG into two parts, having the controller DAG with the task where you fetch the data from MySQL , triggering a second DAG where you execute all of the BashOperator using the values you obtained from the controller DAG . You can pass in the data using conf parameter.您可以使用conf参数传入数据。

Here is an example based on the official Airflow example DAGs :这是一个基于官方Airflow 示例 DAG的示例:

Controller DAG: Controller DAG:

from airflow import DAG
from airflow.models import Variable
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago


def _data_from_mysql():
    # fetch data from the DB or anywhere else
    # set a Variable
    data = {'legion': {'company': 'some_company', 'laptop': 'great_laptop'}}
    Variable.set('jobconfig', data, serialize_json=True)


dag = DAG(
    dag_id="example_trigger_controller_dag",
    default_args={"owner": "airflow"},
    start_date=days_ago(2),
    schedule_interval="@once",
    tags=['example'],
)

get_data_from_MySql = PythonOperator(
    task_id='get_data_from_MySql',
    python_callable=_data_from_mysql,
)

trigger = TriggerDagRunOperator(
    task_id="test_trigger_dagrun",
    # Ensure this equals the dag_id of the DAG to trigger
    trigger_dag_id="example_trigger_target_dag",
    conf={"message": "Company is {{var.json.jobconfig.legion.company}}"},
    execution_date='{{ds}}',
    dag=dag,
)
get_data_from_MySql >> trigger

When the trigger task gets executed will include the key message as part of the configuration for the DAG run of the second DAG .trigger任务被执行时,将包含关键message作为第二个DAGDAG 运行配置的一部分。

Target DAG:目标 DAG:

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago

dag = DAG(
    dag_id="example_trigger_target_dag",
    default_args={"owner": "airflow"},
    start_date=days_ago(2),
    schedule_interval=None,
    tags=['example'],
)


def run_this_func(**context):
    """
    Print the payload "message" passed to the DagRun conf attribute.

    :param context: The execution context
    :type context: dict
    """
    print("Remotely received value of {} for key=message".format(
        context["dag_run"].conf["message"]))


run_this = PythonOperator(
    task_id="run_this", python_callable=run_this_func, dag=dag)

bash_task_1 = BashOperator(
    task_id="bash_task_1",
    bash_command='echo "Here is the message: $message"',
    env={'message': '{{ dag_run.conf["message"] if dag_run else "" }}'},
    dag=dag
)

The logs of bash_task_1 in this example will include:本例中bash_task_1的日志将包括:

[2021-05-05 15:40:35,410] {bash.py:158} INFO - Running command: echo "Here is the message: $message"
[2021-05-05 15:40:35,418] {bash.py:169} INFO - Output:
[2021-05-05 15:40:35,419] {bash.py:173} INFO - Here is the message: Company is some_company
[2021-05-05 15:40:35,420] {bash.py:177} INFO - Command exited with return code 0

Recap:回顾:

  • One task to fetch data from DB and set it as a Variable从数据库获取数据并将其设置为Variable的一项任务
  • Trigger a second DAG passing the data from the Variable in conf触发第二个DAG ,从conf中的Variable传递数据
  • In your target DAG consume data from dag_run.conf在您的目标DAG中使用来自dag_run.conf的数据

This way you are only reading from metadaba DB once, when the second DAG is triggered.这样,当第二个 DAG 被触发时,您只能从 metadaba DB 读取一次。

Also, to avoid repeating too much code during the BashOperator tasks definition you could do something like this:此外,为避免在BashOperator任务定义期间重复太多代码,您可以执行以下操作:

templated_bash_cmd = """
python3 {{params.path_to_script}} {{dag_run.conf["laptopName"]}} {{dag_run.conf["company"]}}
"""

bash_task_1 = BashOperator(
    task_id="bash_task_1",
    bash_command=templated_bash_cmd,
    params={
        'path_to_script': 'path/sequence1.py'
    },
    dag=dag
)

Let me know if that worked for you!让我知道这是否对您有用!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM