简体   繁体   English

ETL 和用户查询的工作流系统

[英]Workflow system for both ETL and Queries by Users

I am looking for a workflow system that supports the following needs:我正在寻找一个支持以下需求的工作流系统:

  1. dealing with a complex ETL pipeline with various kinds of APIs (file-based, REST, console, databases, ...)处理具有各种 API(基于文件、REST、控制台、数据库等)的复杂 ETL 管道
  2. offers automated scheduling/orchestration on different execution environments (AWS, Azure, on-Premise clusters, local machine, ...)在不同的执行环境(AWS、Azure、本地集群、本地机器......)上提供自动调度/编排
  3. has an option for "reactive" workflows ie workflows that can be triggered and executed instantaneously without unnecessary delay, are executed with highest priority and the same workflow can be started several times simultaneously具有“反应性”工作流程的选项,即可以立即触发和执行的工作流程,没有不必要的延迟,以最高优先级执行,并且可以同时启动相同的工作流程多次

Especially the third requirement seems to be tricky to find.尤其是第三个要求似乎很难找到。 The purpose of this requirement is that a user should be able to send a query to activate a (computationally non-heavy) workflow and get back a result immediately instead of waiting some seconds or even minutes and multiple users might want to use the same workflow simultaneously.此要求的目的是用户应该能够发送查询以激活(计算上非繁重的)工作流程并立即返回结果,而不是等待几秒钟甚至几分钟,并且多个用户可能希望使用相同的工作流程同时。 The reason this is important is that the ETL workflows and the user ("reactive") workflows share a substantial overlap and I do intend to reuse parts of these workflows instead of maintaining two sets of workflows that are executed by different tools.这很重要的原因是 ETL 工作流和用户(“反应性”)工作流共享大量重叠,我确实打算重用这些工作流的一部分,而不是维护由不同工具执行的两组工作流。

Apache Airflow appears to be the natural choice for requirements 1. and 2. but does not seem to support the third requirement since it starts the execution in (lengthy) fixed time slots and does not allow for the simulataneous execution of several instances of the same DAG (workflow). Apache Airflow 似乎是要求 1. 和 2. 的自然选择,但似乎不支持第三个要求,因为它在(长)固定时隙中开始执行,并且不允许同时执行相同的多个实例DAG(工作流)。

Are there any tools out there that support all these requirements or do I have to use two different workflow management tools or even have to stick to a (Python) script for the user workflows?是否有任何工具支持所有这些要求,或者我是否必须使用两种不同的工作流管理工具,甚至必须为用户工作流坚持使用(Python)脚本?

You can trigger a dag manually by using the CLI or the API.您可以使用 CLI 或 API 手动触发 dag。 Have a look at this post: https://medium.com/@ntruong/airflow-externally-trigger-a-dag-when-a-condition-match-26cae67ecb1a看看这篇文章: https://medium.com/@ntruong/airflow-externally-trigger-a-dag-when-a-condition-match-26cae67ecb1a

You'll have to test if you can execute multiple dag runs at the same time.您必须测试是否可以同时执行多个 dag 运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM