构建需要上下文限制的 Airflow DAG

Question

I have a group of job units (workers) that I want to run as a DAG我有一组要作为 DAG 运行的工作单元（工人）
Group1 has 10 workers and each worker does multiple table extracts from a DB. Group1 有 10 个工作人员，每个工作人员从数据库中提取多个表。 Note that each worker maps to a single DB instance and each worker needs to successfully deal with 100 tables in total before it can successfully mark itself as complete请注意，每个worker映射到一个单独的数据库实例，每个worker需要成功处理总共100张表才能成功将自己标记为完成
Group1 has a limitation that says no more than 5 tables across all those 10 workers should be consumed at a time. Group1 有一个限制，即在所有这 10 个工作人员中一次应使用不超过 5 个表。 For example:例如：
- Worker1 is extracting 2 tables Worker1 正在提取 2 个表
- Worker2 is extracting 2 tables Worker2 正在提取 2 个表
- Worker3 is extracting 1 table Worker3 正在提取 1 个表
- Worker4...Worker10 need to wait until Worker1...Worker3 relinquishes the threads Worker4...Worker10 需要等到 Worker1...Worker3 放弃线程
- Worker4...Worker10 can pick up tables as soon as threads in step1 frees up Worker4...Worker10 可以在 step1 中的线程释放后立即拿起表
- As each worker completes all the 100 tables, it proceeds to step2 without waiting.当每个工作人员完成所有 100 个表时，它会继续进行步骤 2，而无需等待。 Step2 has no concurrency limits Step2 没有并发限制

I should be able to create a single node Group1 that caters to the throttling and also have我应该能够创建一个满足节流的单个节点 Group1 并且还具有

10 independent nodes of workers so I can restart them in case if anyone of it fails 10 个独立的工作节点，因此我可以重新启动它们，以防万一其中任何一个失败

I have tried explaining this in the following diagram:我尝试在下图中解释这一点：

If any of the worker fails, I can restart it without affecting other workers.如果任何一个工人失败，我可以重新启动它而不影响其他工人。 It still uses the same thread pool from Group1 so the concurrency limits are enforced它仍然使用来自 Group1 的相同线程池，因此强制执行并发限制
Group1 would complete once all elements of step1 and step2 are complete完成 step1 和 step2 的所有元素后 Group1 将完成
Step2 doesn't have any concurrency measures Step2 没有任何并发措施

How do I implement such a hierarchy in Airflow for a Spring Boot Java application?如何在 Airflow 中为 Spring 启动 Java 应用程序实现这样的层次结构？ Is it possible to design this kind of DAG using Airflow constructs and dynamically tell Java application how many tables it can extract at a time.是否可以使用 Airflow 构造设计这种 DAG，并动态地告诉 Java 应用程序一次可以提取多少个表。 For instance, if all workers except Worker1 are finished, Worker1 can now use all 5 threads available while everything else will proceed to step2.例如，如果除 Worker1 之外的所有工作线程都已完成，则 Worker1 现在可以使用所有可用的 5 个线程，而其他所有线程都将继续执行步骤 2。

Answer 1

These constraints cannot be modeled as a directed acyclic graph, and thus cannot implemented in airflow exactly as described.这些约束不能被建模为有向无环图，因此不能完全按照描述在 airflow 中实现。 However, they can be modeled as queues, and thus could be implemented with a job queue framework.但是，它们可以建模为队列，因此可以使用作业队列框架来实现。 Here are your two options:这是您的两个选择：

Implement suboptimally as airflow DAG:次优实施为 airflow DAG：

from airflow.models import DAG
from airflow.operators.subdag_operator import SubDagOperator
# Executors that inherit from BaseExecutor take a parallelism parameter
from wherever import SomeExecutor, SomeOperator

# Table load jobs are done with parallelism 5
load_tables = SubDagOperator(subdag=DAG("load_tables"), executor=SomeExecutor(parallelism=5))

# Each table load must be it's own job, or must be split into sets of tables of predetermined size, such that num_tables_per_job * parallelism = 5
for table in tables:
    load_table = SomeOperator(task_id=f"load_table_{table}", dag=load_tables)

# Jobs done afterwards are done with higher parallelism
afterwards = SubDagOperator(
    subdag=DAG("afterwards"), executor=SomeExecutor(parallelism=high_parallelism)
)

for job in jobs:
    afterward_job = SomeOperator(task_id=f"job_{job}", dag=afterwards)

# After _all_ table load jobs are complete, start the jobs that should be done afterwards

load_tables > afterwards

The suboptimal aspect here, is that, for the first half of the DAG, the cluster will be underutilized by higher_parallelism - 5 .这里的次优方面是，对于 DAG 的前半部分， higher_parallelism - 5将未充分利用集群。

Implement optimally with job queue:使用作业队列优化实施：

# This is pseudocode, but could be easily adapted to a framework like Celery

# You need two queues
# The table load queue should be initialized with the job items
table_load_queue = Queue(initialize_with_tables)
# The queue for jobs to do afterwards starts empty
afterwards_queue = Queue()

def worker():

    # Work while there's at least one item in either queue
    while not table_load_queue.empty() or not afterwards_queue.empty():
        working_on_table_load = [worker.is_working_table_load for worker in scheduler.active()]

        # Work table loads if we haven't reached capacity, otherwise work the jobs afterwards
        if sum(working_on_table_load) < 5:
            is_working_table_load = True
            task = table_load_queue.dequeue()
        else
            is_working_table_load = False
            task = afterwards_queue.dequeue()

        if task:
            after = work(task)
            if is_working_table_load:

                # After working a table load, create the job to work afterwards
                afterwards_queue.enqueue(after)

# Use all the parallelism available
scheduler.start(worker, num_workers=high_parallelism)

Using this approach, the cluster won't be underutilized.使用这种方法，集群不会被充分利用。

构建需要上下文限制的 Airflow DAG

问题描述

1 个解决方案

解决方案1
1 2020-06-05 16:45:09

Implement suboptimally as airflow DAG:次优实施为 airflow DAG：

Implement optimally with job queue:使用作业队列优化实施：

构建需要上下文限制的 Airflow DAG

问题描述

1 个解决方案

解决方案1 1 2020-06-05 16:45:09

Implement suboptimally as airflow DAG:次优实施为 airflow DAG：

Implement optimally with job queue:使用作业队列优化实施：

解决方案1
1 2020-06-05 16:45:09