简体   繁体   English

从 BranchPythonOperator 使用动态子任务 ID 调用 TaskGroup

[英]Calling TaskGroup with Dynamic sub task id from BranchPythonOperator

I want to call a TaskGroup with a Dynamic sub-task id from BranchPythonOperator.我想从 BranchPythonOperator 调用一个带有动态子任务 ID 的任务组。

This is the DAG flow that I have:这是我拥有的 DAG 流程:

branch_dag branch_dag

My case is I want to check whether a table exists in BigQuery or not.我的情况是我想检查 BigQuery 中是否存在一个表。

  • If exists: do nothing and end the DAG如果存在:什么都不做并结束 DAG

  • If not exists: Ingest the data from Postgres to Google Cloud Storage如果不存在:将数据从 Postgres 提取到 Google Cloud Storage

I know that to call a TaskGroup from BranchPythonOperator is by calling the task id with following format:我知道从 BranchPythonOperator 调用 TaskGroup 是通过调用具有以下格式的任务 ID:
group_task_id.task_id

The problem is, my task group's sub task id is dynamic, depends on how many time I loop the TaskGroup.问题是,我的任务组的子任务 ID 是动态的,取决于我循环任务组的次数。 So the sub_task will be:所以 sub_task 将是:

parent_task_id.sub_task_1
parent_task_id.sub_task_2
parent_task_id.sub_task_3
...
parent_task_id.sub_task_x

This is the following code for the DAG that I have:这是我拥有的 DAG 的以下代码:

import airflow
from airflow.providers.google.cloud.transfers.postgres_to_gcs import PostgresToGCSOperator
from airflow.utils.task_group import TaskGroup
from google.cloud.exceptions import NotFound
from airflow import DAG
from airflow.operators.python import BranchPythonOperator
from airflow.operators.dummy import DummyOperator
from google.cloud import bigquery

default_args = {
    'owner': 'Airflow',
    'start_date': airflow.utils.dates.days_ago(2),
}

with DAG(dag_id='branch_dag', default_args=default_args, schedule_interval=None) as dag:

    def create_task_group(worker=1):
        var  = dict()
        with TaskGroup(group_id='parent_task_id') as tg1:
            for i in range(worker):
                var[f'sub_task_{i}'] = PostgresToGCSOperator(
                    task_id = f'sub_task_{i}',
                    postgres_conn_id = 'some_postgres_conn_id',
                    sql = 'test.sql',
                    bucket = 'test_bucket',
                    filename = 'test_file.json',
                    export_format = 'json',
                    gzip = True,
                    params = {
                        'worker': worker
                    }
                )
        return tg1
    
    def is_exists_table():
        client = bigquery.Client()
        try:
            table_name = client.get_table('dataset_id.some_table')
            if table_name:
                return 'task_end'
        except NotFound as error:       
            return 'parent_task_id'

    task_start = DummyOperator(
        task_id = 'start'
        )

    task_branch_table = BranchPythonOperator(
        task_id ='check_table_exists_in_bigquery',
        python_callable = is_exists_table
        )

    task_pg_to_gcs_init = create_task_group(worker=3)

    task_end = DummyOperator(
        task_id = 'end',
        trigger_rule = 'all_done'
    )    

    task_start >> task_branch_table >> task_end
    task_start >> task_branch_table >> task_pg_to_gcs_init >> task_end

When I run the dag, it returns当我运行 dag 时,它返回

** airflow.exceptions.TaskNotFound: Task parent_task_id not found ** ** airflow.exceptions.TaskNotFound: Task parent_task_id not found **

But this is expected, what I don't know is how to iterate the parent_task_id.sub_task_x on is_exists_table function. Or are there any workaround?但这是预料之中的,我不知道如何在parent_task_id.sub_task_x上迭代is_exists_table 。或者有什么解决方法吗?

This is the test.sql file if it's needed如果需要,这是test.sql文件


SELECT 
id,
name,
country
FROM some_table
WHERE 1=1
AND ABS(MOD(hashtext(id::TEXT), 3)) = {{params.worker}};

-- returns 1M+ rows

I already seen this question as reference Question but I think my case is more specific.我已经将此问题视为参考问题,但我认为我的情况更具体。

When designing your data pipelines, you may encounter use cases that require more complex task flows than "Task A > Task B > Task C."在设计数据管道时,您可能会遇到需要比“任务 A > 任务 B > 任务 C”更复杂任务流的用例。 For example, you may have a use case where you need to decide between multiple tasks to execute based on the results of an upstream task.例如,您可能有一个用例,您需要根据上游任务的结果在要执行的多个任务之间做出决定。 Or you may have a case where part of your pipeline should only run under certain external conditions.或者您可能会遇到这样的情况,即您的管道的一部分应该只在特定的外部条件下运行。 Fortunately, Airflow has multiple options for building conditional logic and/or branching into your DAGs.幸运的是,Airflow 有多个选项可用于构建条件逻辑和/或分支到您的 DAG。

I found a dirty way around it.我发现了一个肮脏的方法。

What I did is creating 1 additional task using DummyOperator called task_pass.我所做的是使用名为 task_pass 的 DummyOperator 创建 1 个额外的任务。

    task_pass = DummyOperator(
        task_id = 'pass_to_task_group'
    )

So the DAG flow now looks like this:所以 DAG 流程现在看起来像这样:

task_start >> task_branch_table >> task_end
task_start >> task_branch_table >> task_pass >> task_pg_to_gcs_init >> task_end

Also there is 1 mistake that I made from the code above, notice that the params I set was worker.我在上面的代码中也犯了 1 个错误,请注意我设置的参数是 worker。 This is wrong because worker is the constant while the thing that I need to iterate is the i variable.这是错误的,因为 worker 是常量,而我需要迭代的是 i 变量。 So I change it from:所以我将其更改为:

params: worker

to:到:

params: i

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM