[英]Want to create airflow tasks that are downstream of the current task
I'm mostly brand new to airflow. 我主要是气流新手。
I have a two step process: 我有两个步骤:
The files are half a gig compressed, and 2 - 3 gig when uncompressed. 文件压缩后为半个演出,未压缩时为2-3个演出。 I can easily have 20+ files to process at a time, which means uncompressing all of them can run longer than just about any reasonable timeout
我可以很容易地有超过20个档案的时间,来处理这意味着解压所有的人都可以对任何合理的超时运行不仅仅是长
I could use XCom to get the results of step 1, but what I'd like to do is something like this: 我可以用XCOM得到步骤1的结果,但我想要做的是这样的:
def processFiles (reqDir, gvcfDir, matchSuffix):
theFiles = getFiles (reqDir, gvcfDir, matchSuffix)
for filePair in theFiles:
task = PythonOperator (task_id = "Uncompress_" + os.path.basename (theFile),
python_callable = expandFile,
op_kwargs = {'theFile': theFile},
dag = dag)
task.set_upstream (runThis)
The problem is that "runThis" is the PythonOperator that called processFiles, so it has to be declared after processFiles. 问题是,“runThis”是叫processFiles的PythonOperator,所以它有processFiles后声明。
Is there any way to make this work? 有什么办法可以使这项工作吗?
Is this the reason that XCom exists, and I should dump this approach and go with XCom? 这是XCom存在的原因,我应该放弃这种方法并选择XCom吗?
Regarding your proposed solution, I don't think you can use XComs to achieve this, as they are only available to instances and not when you define the DAG (to the best of my knowledge). 关于您提出的解决方案,我认为您不能使用XComs来实现此目的,因为它们仅适用于实例,而不能在定义DAG时使用(据我所知)。
You can however use a SubDAG to achieve your objective. 但是,您可以使用SubDAG来实现您的目标。 The
SubDagOperator
gets a function which is going to be invoked when the operator is going to be executed and that generates a DAG, giving you a chance to dynamically create a sub-section of your workflow. SubDagOperator
获取一个将在要执行该运算符时调用的函数,该函数将生成DAG,从而使您有机会动态创建工作流的子部分。
You can test the idea using this simple example, which generates a random of tasks every time it's invoked: 您可以使用以下简单示例测试该想法,该示例每次被调用时都会随机生成任务:
import airflow
from builtins import range
from random import randint
from airflow.operators.bash_operator import BashOperator
from airflow.operators.subdag_operator import SubDagOperator
from airflow.models import DAG
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2)
}
dag = DAG(dag_id='dynamic_dag', default_args=args)
def generate_subdag(parent_dag, dag_id, default_args):
# pseudo-randomly determine a number of tasks to be created
n_tasks = randint(1, 10)
subdag = DAG(
'%s.%s' % (parent_dag.dag_id, dag_id),
schedule_interval=parent_dag.schedule_interval,
start_date=parent_dag.start_date,
default_args=default_args
)
for i in range(n_tasks):
i = str(i)
task = BashOperator(task_id='echo_%s' % i, bash_command='echo %s' % i, dag=subdag)
return subdag
subdag_dag_id = 'dynamic_subdag'
SubDagOperator(
subdag=generate_subdag(dag, subdag_dag_id, args),
task_id=subdag_dag_id,
dag=dag
)
If you execute this you'll notice that in different runs SubDAGs are likely to contain a different number of tasks (I tested this with version 1.8.0). 如果执行此操作,您会注意到SubDAG在不同的运行中可能包含不同数量的任务(我在1.8.0版中对此进行了测试)。 You can access the SubDAG view on the WebUI by accessing the graph view, clicking on the grey SubDAG node and then on "Zoom into SubDAG".
通过访问图形视图,单击灰色的SubDAG节点,然后单击“缩放到SubDAG”,可以访问WebUI上的SubDAG视图。
You can use this concept by listing files and creating one task for each of those instead of just generating them in a random number like in the example. 您可以通过列出文件并为每个文件创建一个任务来使用此概念,而不仅仅是像示例中那样以随机数生成它们。 The tasks themselves can be arranged in parallel (as I did), sequentially or in any valid directed acyclic layout.
任务本身可以并行排列(如我所做的那样),顺序排列或以任何有效的有向非循环布局排列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.