简体   繁体   English

如何动态迭代上游任务的 output 以在 airflow 中创建并行任务?

[英]How to dynamically iterate over the output of an upstream task to create parallel tasks in airflow?

Consider the following example of a DAG where the first task, get_id_creds , extracts a list of credentials from a database.考虑以下 DAG 示例,其中第一个任务get_id_creds从数据库中提取凭据列表。 This operation tells me what users in my database I am able to run further data preprocessing on and it writes those ids to the file /tmp/ids.txt .此操作告诉我数据库中的哪些用户可以运行进一步的数据预处理,并将这些 id 写入文件/tmp/ids.txt I then scan those ids into my DAG and use them to generate a list of upload_transaction tasks that can be run in parallel.然后,我将这些 id 扫描到我的 DAG 中,并使用它们生成可以并行运行的upload_transaction任务列表。

My question is: Is there a more idiomatically correct, dynamic way to do this using airflow?我的问题是:使用 airflow 是否有更惯用正确的动态方式来执行此操作? What I have here feels clumsy and brittle.我这里的东西感觉笨拙而脆弱。 How can I directly pass a list of valid IDs from one process to that defines the subsequent downstream processes?如何将有效 ID 列表从一个进程直接传递给定义后续下游进程的进程?

from datetime import datetime, timedelta
import os
import sys

from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator

import ds_dependencies

SCRIPT_PATH = os.getenv('DASH_PREPROC_PATH')
if SCRIPT_PATH:
    sys.path.insert(0, SCRIPT_PATH)
    import dash_workers
else:
    print('Define DASH_PREPROC_PATH value in environmental variables')
    sys.exit(1)

default_args = {
  'start_date': datetime.now(),
  'schedule_interval': None
}

DAG = DAG(
  dag_id='dash_preproc',
  default_args=default_args
)

get_id_creds = PythonOperator(
    task_id='get_id_creds',
    python_callable=dash_workers.get_id_creds, 
    provide_context=True,
    dag=DAG)

with open('/tmp/ids.txt', 'r') as infile:
    ids = infile.read().splitlines()

for uid in uids:
    upload_transactions = PythonOperator(
        task_id=uid,
        python_callable=dash_workers.upload_transactions,
        op_args=[uid],
        dag=DAG)
    upload_transactions.set_downstream(get_id_creds)

Per @Juan Riza's suggestion I checked out this link: Proper way to create dynamic workflows in Airflow . 根据@Juan Riza的建议,我查看了这个链接: 在Airflow中创建动态工作流程的正确方法 This was pretty much the answer, although I was able to simplify the solution enough that I thought I would offer my own modified version of the implementation here: 这几乎是答案,虽然我能够简化解决方案,我认为我会在这里提供我自己的修改版本的实现:

from datetime import datetime
import os
import sys

from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator

import ds_dependencies

SCRIPT_PATH = os.getenv('DASH_PREPROC_PATH')
if SCRIPT_PATH:
    sys.path.insert(0, SCRIPT_PATH)
    import dash_workers
else:
    print('Define DASH_PREPROC_PATH value in environmental variables')
    sys.exit(1)

ENV = os.environ

default_args = {
  # 'start_date': datetime.now(),
  'start_date': datetime(2017, 7, 18)
}

DAG = DAG(
  dag_id='dash_preproc',
  default_args=default_args
)

clear_tables = PythonOperator(
  task_id='clear_tables',
  python_callable=dash_workers.clear_db,
  dag=DAG)

def id_worker(uid):
    return PythonOperator(
        task_id=uid,
        python_callable=dash_workers.main_preprocess,
        op_args=[uid],
        dag=DAG)

for uid in capone_dash_workers.get_id_creds():
    clear_tables >> id_worker(uid)

clear_tables cleans the database that will be re-built as a result of the process. clear_tables清除将由于该进程而重建的数据库。 id_worker is a function that dynamically generates new preprocessing tasks, based on the array of ID values returned from get_if_creds . id_worker是一个根据get_if_creds返回的ID值数组动态生成新预处理任务的函数。 The task ID is just the corresponding user ID, though it could easily have been an index, i , as in the example mentioned above. 任务ID只是相应的用户ID,虽然它可以很容易地成为索引, i ,如上面提到的例子。

NOTE That bitshift operator ( << ) looks backwards to me, as the clear_tables task should come first, but it's what seems to be working in this case. 注意 ,bitshift运算符( << )向后看我,因为clear_tables任务应该首先出现,但它似乎在这种情况下起作用。

Considering that Apache Airflow is a workflow management tool, ie. 考虑到Apache Airflow是一个工作流管理工具,即。 it determines the dependencies between task that the user defines in comparison (as an example) with apache Nifi which is a dataflow management tool, ie. 它确定了用户定义的任务与作为数据流管理工具的apache Nifi进行比较(作为示例)之间的依赖关系,即。 the dependencies here are data which are transferd through the tasks. 这里的依赖关系是通过任务传递的数据。

That said, i think that your approach is quit right (my comment is based on the code posted) but Airflow offers a concept called XCom . 也就是说,我认为你的方法是正确的(我的评论基于发布的代码), Airflow提供了一个名为XCom的概念。 It allows tasks to "cross-communicate" between them by passing some data. 它允许任务通过传递一些数据在它们之间“交叉通信”。 How big should the passed data be ? 传递的数据应该有多大? it is up to you to test! 由你来测试! But generally it should be not so big. 但一般来说它应该不是那么大。 I think it is in the form of key,value pairs and it get stored in the airflow meta-database,ie you can't pass files for example but a list with ids could work. 我认为它是以键,值对的形式存储在气流元数据库中,即你不能传递文件,但是带有id的列表可以工作。

Like i said you should test it your self. 就像我说你应该测试你自己。 I would be very happy to know your experience. 我很高兴知道你的经历。 Here is an example dag which demonstrates the use of XCom and here is the necessary documentation. 是一个示例dag,演示了XCom的使用, 是必要的文档。 Cheers! 干杯!

copying my answer from this question .这个问题复制我的答案。 Only for v2.3 and above:仅适用于v2.3及更高版本:

This feature is achieved using Dynamic Task Mapping, only for Airflow versions 2.3 and higher此功能是使用动态任务映射实现的,仅适用于 Airflow 版本 2.3 及更高版本

More documentation and example here:更多文档和示例在这里:

Example:例子:

@task
def make_list():
    # This can also be from an API call, checking a database, -- almost anything you like, as long as the
    # resulting list/dictionary can be stored in the current XCom backend.
    return [1, 2, {"a": "b"}, "str"]


@task
def consumer(arg):
    print(list(arg))


with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
    consumer.expand(arg=make_list())

example 2:示例 2:

from airflow import XComArg

task = MyOperator(task_id="source")

downstream = MyOperator2.partial(task_id="consumer").expand(input=XComArg(task))

The graph view and tree view are also updated:图表视图和树视图也更新了:

  • 图表视图
  • 树视图

Relevant issues here:这里的相关问题:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM