简体   繁体   English

在气流中生成多个任务时,上游/下游关系相反

[英]Reversed upstream/downstream relationships when generating multiple tasks in Airflow

The original code related to this question can be found here . 与这个问题有关的原始代码可以在这里找到。

I'm confused by up both bitshift operators and set_upstream / set_downstream methods are working within a task loop that I've defined in my DAG. 我对set_upstream运算符和set_upstream / set_downstream方法都感到困惑,因为它们在我在DAG中定义的任务循环中工作。 When the main execution loop of the DAG is configured as follows: DAG的主执行循环配置如下:

for uid in dash_workers.get_id_creds():
    clear_tables.set_downstream(id_worker(uid))

or 要么

for uid in dash_workers.get_id_creds():
    clear_tables >> id_worker(uid)

The graph looks like this (the alpha-numeric sequence are the user IDs, which also define the task IDs): 该图如下所示(字母数字序列是用户ID,也定义了任务ID):

在此处输入图片说明

when I configure the main execution loop of the DAG like this: 当我像这样配置DAG的主执行循环时:

for uid in dash_workers.get_id_creds():
    clear_tables.set_upstream(id_worker(uid))

or 要么

for uid in dash_workers.get_id_creds():
    id_worker(uid) >> clear_tables

the graph looks like this: 该图如下所示:

在此处输入图片说明

The second graph is what I want / what I would have expected the first two snippets of code to have produced based on my reading of the docs. 第二张图是我想要的/基于我对文档的阅读,我期望产生的前两个代码片段是什么。 If I want clear_tables to execute first before triggering my batch of data parsing tasks for different user IDs should I indicate this as clear_tables >> id_worker(uid) 如果我想先执行clear_tables ,然后再为不同的用户ID触发我的数据解析任务批,我应该将其指示为clear_tables >> id_worker(uid)

EDIT -- Here's the complete code, which has been updated since I posted the last few questions, for reference: 编辑 -这是完整的代码,自从我发布了最后几个问题以来已进行了更新,以供参考:

from datetime import datetime
import os
import sys

from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator

import ds_dependencies

SCRIPT_PATH = os.getenv('DASH_PREPROC_PATH')
if SCRIPT_PATH:
    sys.path.insert(0, SCRIPT_PATH)
    import dash_workers
else:
    print('Define DASH_PREPROC_PATH value in environmental variables')
    sys.exit(1)

ENV = os.environ

default_args = {
  'start_date': datetime.now(),
}

DAG = DAG(
  dag_id='dash_preproc',
  default_args=default_args
)

clear_tables = PythonOperator(
  task_id='clear_tables',
  python_callable=dash_workers.clear_db,
  dag=DAG)

def id_worker(uid):
    return PythonOperator(
        task_id=id,
        python_callable=dash_workers.main_preprocess,
        op_args=[uid],
        dag=DAG)

for uid in dash_workers.get_id_creds():
    preproc_task = id_worker(uid)
    clear_tables << preproc_task

After implementing @LadislavIndra's suggestion I continue to have the same reversed implementation of the bitshift operator in order to get the correct dependency graph. 在实现@LadislavIndra的建议之后,我将继续对bitshift运算符进行相同的反向实现,以获取正确的依存关系图。

UPDATE @AshBerlin-Taylor's answer is what's going on here. 更新 @ AshBerlin-Taylor的答案是这里正在发生的事情。 I assumed that Graph View and Tree View were doing the same thing, but they're not. 我以为“图形视图”和“树形视图”在做相同的事情,但事实并非如此。 Here's what id_worker(uid) >> clear_tables looks like in graph view: 这是id_worker(uid) >> clear_tables在图形视图中的外观:

在此处输入图片说明

I certainly don't want the final step in my data pre-prep routine to be to delete all data tables! 我当然不希望我的数据准备例程中的最后一步是删除所有数据表!

The tree view in Airflow is "backwards" to how you (and I!) first thought about it. Airflow中的树状视图“落后”了您(和我!)最初的想法。 In your first screenshot it is showing that "clear_tables" must be run before the "AAAG5608078M2" run task. 在您的第一个屏幕截图中,它表明“ clear_tables”必须在“ AAAG5608078M2”运行任务之前运行。 And the DAG status depends upon each of the id worker tasks. DAG状态取决于每个id工作者任务。 So instead of a task order, it's a tree of the status chain. 因此,它是状态链的一棵树,而不是任务顺序。 If that makes any sense at all. 如果有任何意义的话。

(This might seem strange at first, but it's because a DAG can branch out and branch back in.) (起初这看起来可能很奇怪,但这是因为DAG可以分支出来,然后再分支回去。)

You might have better luck looking at the Graph view for your dag. 您可能会更好地查看图表视图中的数据。 This one has arrows and shows the execution order in a more intuitive way. 这有箭头,并以更直观的方式显示了执行顺序。 (Though I do now find the tree view useful. It's just less clear to start with) (尽管我现在确实发现树视图很有用。开始时不太清楚)

Looking through your other code, it seems get_id_creds is your task and you're trying to loop through it, which is creating some weird interaction. 查看您的其他代码,似乎get_id_creds是您的任务,并且您正在尝试遍历它,这创建了一些奇怪的交互。

A pattern that will work is: 一种有效的模式是:

clear_tables = MyOperator()

for uid in uid_list:
  my_task = MyOperator(task_id=uid)
  clear_tables >> my_task

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM