[英]Architecturing Airflow DAG that needs contextual throttling
I should be able to create a single node Group1 that caters to the throttling and also have我应该能够创建一个满足节流的单个节点 Group1 并且还具有
I have tried explaining this in the following diagram:我尝试在下图中解释这一点:
How do I implement such a hierarchy in Airflow for a Spring Boot Java application?如何在 Airflow 中为 Spring 启动 Java 应用程序实现这样的层次结构? Is it possible to design this kind of DAG using Airflow constructs and dynamically tell Java application how many tables it can extract at a time.是否可以使用 Airflow 构造设计这种 DAG,并动态地告诉 Java 应用程序一次可以提取多少个表。 For instance, if all workers except Worker1 are finished, Worker1 can now use all 5 threads available while everything else will proceed to step2.例如,如果除 Worker1 之外的所有工作线程都已完成,则 Worker1 现在可以使用所有可用的 5 个线程,而其他所有线程都将继续执行步骤 2。
These constraints cannot be modeled as a directed acyclic graph, and thus cannot implemented in airflow exactly as described.这些约束不能被建模为有向无环图,因此不能完全按照描述在 airflow 中实现。 However, they can be modeled as queues, and thus could be implemented with a job queue framework.但是,它们可以建模为队列,因此可以使用作业队列框架来实现。 Here are your two options:这是您的两个选择:
from airflow.models import DAG
from airflow.operators.subdag_operator import SubDagOperator
# Executors that inherit from BaseExecutor take a parallelism parameter
from wherever import SomeExecutor, SomeOperator
# Table load jobs are done with parallelism 5
load_tables = SubDagOperator(subdag=DAG("load_tables"), executor=SomeExecutor(parallelism=5))
# Each table load must be it's own job, or must be split into sets of tables of predetermined size, such that num_tables_per_job * parallelism = 5
for table in tables:
load_table = SomeOperator(task_id=f"load_table_{table}", dag=load_tables)
# Jobs done afterwards are done with higher parallelism
afterwards = SubDagOperator(
subdag=DAG("afterwards"), executor=SomeExecutor(parallelism=high_parallelism)
)
for job in jobs:
afterward_job = SomeOperator(task_id=f"job_{job}", dag=afterwards)
# After _all_ table load jobs are complete, start the jobs that should be done afterwards
load_tables > afterwards
The suboptimal aspect here, is that, for the first half of the DAG, the cluster will be underutilized by higher_parallelism - 5
.这里的次优方面是,对于 DAG 的前半部分, higher_parallelism - 5
将未充分利用集群。
# This is pseudocode, but could be easily adapted to a framework like Celery
# You need two queues
# The table load queue should be initialized with the job items
table_load_queue = Queue(initialize_with_tables)
# The queue for jobs to do afterwards starts empty
afterwards_queue = Queue()
def worker():
# Work while there's at least one item in either queue
while not table_load_queue.empty() or not afterwards_queue.empty():
working_on_table_load = [worker.is_working_table_load for worker in scheduler.active()]
# Work table loads if we haven't reached capacity, otherwise work the jobs afterwards
if sum(working_on_table_load) < 5:
is_working_table_load = True
task = table_load_queue.dequeue()
else
is_working_table_load = False
task = afterwards_queue.dequeue()
if task:
after = work(task)
if is_working_table_load:
# After working a table load, create the job to work afterwards
afterwards_queue.enqueue(after)
# Use all the parallelism available
scheduler.start(worker, num_workers=high_parallelism)
Using this approach, the cluster won't be underutilized.使用这种方法,集群不会被充分利用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.