简体   繁体   English

气流:将动态值传递给Sub DAG操作员

[英]Airflow : Passing a dynamic value to Sub DAG operator

I am new to Airflow. 我是Airflow的新手。
I have come across a scenario, where Parent DAG need to pass some dynamic number (let's say n ) to Sub DAG. 我遇到过一个场景,其中Parent DAG需要将一些动态数字(比方说n )传递给Sub DAG。
Where as SubDAG will use this number to dynamically create n parallel tasks. SubDAG将使用此数字动态创建n并行任务。

Airflow documentation doesn't cover a way to achieve this. 气流文档未涵盖实现此目的的方法。 So I have explore couple of ways : 所以我探索了几种方法:

Option - 1(Using xcom Pull) 选项-1(使用xcom Pull)

I have tried to pass as a xcom value, but for some reason SubDAG is not resolving to the passed value. 我试图传递为xcom值,但由于某种原因,SubDAG没有解析为传递的值。

Parent Dag File 父Dag文件

def load_dag(**kwargs):
    number_of_runs = json.dumps(kwargs['dag_run'].conf['number_of_runs'])
    dag_data = json.dumps({
        "number_of_runs": number_of_runs
    })
    return dag_data

# ------------------ Tasks ------------------------------
load_config = PythonOperator(
    task_id='load_config',
    provide_context=True,
    python_callable=load_dag,
    dag=dag)


t1 = SubDagOperator(
    task_id=CHILD_DAG_NAME,
    subdag=sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, default_args, "'{{ ti.xcom_pull(task_ids='load_config') }}'" ),
    default_args=default_args,
    dag=dag,
)

Sub Dag File Sub Dag文件

def sub_dag(parent_dag_name, child_dag_name, args, num_of_runs):
    dag_subdag = DAG(
        dag_id='%s.%s' % (parent_dag_name, child_dag_name),
        default_args=args,
        schedule_interval=None)

    variabe_names = {}

    for i in range(num_of_runs):
        variabe_names['task' + str(i + 1)] =  DummyOperator(
        task_id='dummy_task',
        dag=dag_subdag,
    )

    return dag_subdag

Option - 2 选项 - 2

I have also tried to pass number_of_runs as a global variable, which was not working. 我也尝试将number_of_runs作为全局变量传递,但是没有用。

Option - 3 选项 - 3

Also we tried to write this value to a data file. 我们还尝试将此值写入数据文件。 But sub DAG is throwing File doesn't exist error . 但是子DAG正在抛出File doesn't exist error This might be because we are dynamically generating this file. 这可能是因为我们正在动态生成此文件。

Can some one help me with this. 有人可以帮我弄这个吗。

I've done it with Option 3. The key is to return a valid dag with no tasks, if the file does not exist. 我已经使用选项3完成了它。关键是如果文件不存在则返回没有任务的有效dag。 So load_config will generate a file with your number of tasks or more information if needed. 因此,如果需要,load_config将生成包含您的任务数量或更多信息的文件。 Your subdag factory would look something like: 您的子工厂看起来像:

def subdag(...):
    sdag = DAG('%s.%s' % (parent, child), default_args=args, schedule_interval=timedelta(hours=1))
    file_path = "/path/to/generated/file"
    if os.path.exists(file_path):
        data_file = open(file_path)
        list_tasks = data_file.readlines()
        for task in list_tasks:
            DummyOperator(
                  task_id='task_'+task,
                  default_args=args,
                  dag=sdag,
            )
    return sdag

At dag generation you will see a subdag with No tasks. 在dag生成中,您将看到一个没有任务的子标记。 At dag execution, after load_config is done, you can see you dynamically generated subdag 在执行dag时,在load_config完成后,您可以看到动态生成的子标记

Option 1 should work if you just change the call to xcom_pull to include the dag_id of the parent dag. 如果您只是xcom_pull的调用xcom_pull为包含父dag的dag_id ,则选项1应该有效。 By default the xcom_pull call will look for the task_id 'load_config' in its own dag which doesnt exist. 默认情况下, xcom_pull通话将寻找task_id 'load_config'在自己的DAG其犯规存在。

so change the x_com call macro to: 所以将x_com调用宏更改为:

subdag=sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, default_args, "'{{ ti.xcom_pull(task_ids='load_config', dag_id='" + PARENT_DAG_NAME + "' }}'" ),

If the filename you are writing to is not dynamic (eg you are writing over the same file over and over again for each task instance), Jaime's answer will work: 如果您要写入的文件名不是动态的(例如,您为每个任务实例反复写入同一文件),Jaime的答案将起作用:

file_path = "/path/to/generated/file"

But if you need a unique filename or want different content written to the file by each task instance for tasks executed in parallel, airflow will not work for this case, since there is no way to pass the execution date or variable outside of a template. 但是,如果您需要一个唯一的文件名或希望每个任务实例为并行执行的任务写入文件的不同内容,则气流将不适用于此情况,因为无法在模板外传递执行日期或变量。 Take a look at this post . 看看这篇文章

看看我在这里的答案,其中我描述了一种基于使用xcoms和subdag的先前执行任务的结果动态创建任务的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM