简体   繁体   English

想要创建当前任务下游的气流任务

[英]Want to create airflow tasks that are downstream of the current task

I'm mostly brand new to airflow. 我主要是气流新手。

I have a two step process: 我有两个步骤:

  1. Get all files that match a criteria 获取所有符合条件的文件
  2. Uncompress the files 解压缩文件

The files are half a gig compressed, and 2 - 3 gig when uncompressed. 文件压缩后为半个演出,未压缩时为2-3个演出。 I can easily have 20+ files to process at a time, which means uncompressing all of them can run longer than just about any reasonable timeout 我可以很容易地有超过20个档案的时间,来处理这意味着解压所有的人都可以对任何合理的超时运行不仅仅是长

I could use XCom to get the results of step 1, but what I'd like to do is something like this: 我可以用XCOM得到步骤1的结果,但我想要做的是这样的:

def processFiles (reqDir, gvcfDir, matchSuffix):
    theFiles = getFiles (reqDir, gvcfDir, matchSuffix)

    for filePair in theFiles:
        task = PythonOperator (task_id = "Uncompress_" + os.path.basename (theFile), 
                                python_callable = expandFile, 
                                op_kwargs = {'theFile': theFile}, 
                                dag = dag)
task.set_upstream (runThis)

The problem is that "runThis" is the PythonOperator that called processFiles, so it has to be declared after processFiles. 问题是,“runThis”是叫processFiles的PythonOperator,所以它有processFiles后声明。

Is there any way to make this work? 有什么办法可以使这项工作吗?

Is this the reason that XCom exists, and I should dump this approach and go with XCom? 这是XCom存在的原因,我应该放弃这种方法并选择XCom吗?

Regarding your proposed solution, I don't think you can use XComs to achieve this, as they are only available to instances and not when you define the DAG (to the best of my knowledge). 关于您提出的解决方案,我认为您不能使用XComs来实现此目的,因为它们仅适用于实例,而不能在定义DAG时使用(据我所知)。

You can however use a SubDAG to achieve your objective. 但是,您可以使用SubDAG来实现您的目标。 The SubDagOperator gets a function which is going to be invoked when the operator is going to be executed and that generates a DAG, giving you a chance to dynamically create a sub-section of your workflow. SubDagOperator获取一个将在要执行该运算符时调用的函数,该函数将生成DAG,从而使您有机会动态创建工作流的子部分。

You can test the idea using this simple example, which generates a random of tasks every time it's invoked: 您可以使用以下简单示例测试该想法,该示例每次被调用时都会随机生成任务:

import airflow
from builtins import range
from random import randint
from airflow.operators.bash_operator import BashOperator
from airflow.operators.subdag_operator import SubDagOperator
from airflow.models import DAG

args = {
    'owner': 'airflow',
    'start_date': airflow.utils.dates.days_ago(2)
}

dag = DAG(dag_id='dynamic_dag', default_args=args)

def generate_subdag(parent_dag, dag_id, default_args):
    # pseudo-randomly determine a number of tasks to be created
    n_tasks = randint(1, 10)

    subdag = DAG(
        '%s.%s' % (parent_dag.dag_id, dag_id),
        schedule_interval=parent_dag.schedule_interval,
        start_date=parent_dag.start_date,
        default_args=default_args
    )
    for i in range(n_tasks):
        i = str(i)
        task = BashOperator(task_id='echo_%s' % i, bash_command='echo %s' % i, dag=subdag)

    return subdag

subdag_dag_id = 'dynamic_subdag'

SubDagOperator(
    subdag=generate_subdag(dag, subdag_dag_id, args),
    task_id=subdag_dag_id,
    dag=dag
)

If you execute this you'll notice that in different runs SubDAGs are likely to contain a different number of tasks (I tested this with version 1.8.0). 如果执行此操作,您会注意到SubDAG在不同的运行中可能包含不同数量的任务(我在1.8.0版中对此进行了测试)。 You can access the SubDAG view on the WebUI by accessing the graph view, clicking on the grey SubDAG node and then on "Zoom into SubDAG". 通过访问图形视图,单击灰色的SubDAG节点,然后单击“缩放到SubDAG”,可以访问WebUI上的SubDAG视图。

You can use this concept by listing files and creating one task for each of those instead of just generating them in a random number like in the example. 您可以通过列出文件并为每个文件创建一个任务来使用此概念,而不仅仅是像示例中那样以随机数生成它们。 The tasks themselves can be arranged in parallel (as I did), sequentially or in any valid directed acyclic layout. 任务本身可以并行排列(如我所做的那样),顺序排列或以任何有效的有向非循环布局排列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 只运行下游任务而不是当前任务 - 气流 - run only downstream tasks not the current task - airflow 如何回填气流中的下游任务 - How to backfill downstream tasks in airflow 有没有办法将任务的返回值存储在 Python 变量中并与下游任务共享(不使用 xcom 或 airflow 变量) - Is there any way to store the return value of a task in Python variable and share it with downstream tasks (without using xcom or airflow variable) 在气流中的下游任务中使用任务结果 - Using task results in downstream task in airflow 根据气流中的上游任务终止下游任务 - Kill downstream task depending on upstream task in airflow 如何在 airflow 中检查下游任务中的任务状态 - how to check status of task inside downstream task in airflow 在气流中生成多个任务时,上游/下游关系相反 - Reversed upstream/downstream relationships when generating multiple tasks in Airflow 如何从气流中的任务动态生成下游列表 - How to dynamically generate downstream list from task in airflow 如何在气流中动态创建任务 - How to dynamically create tasks in airflow 如何动态迭代上游任务的 output 以在 airflow 中创建并行任务? - How to dynamically iterate over the output of an upstream task to create parallel tasks in airflow?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM