简体   繁体   English

如何在特定目录上手动运行Airflow DAG

[英]How to manually run Airflow DAG on a particular directory

I am evaluating whether Airflow is suitable for my needs (in bioinformatics). 我正在评估气流是否适合我的需求(在生物信息学中)。 I am having some difficulty with the Airflow model. 我在使用气流模型时遇到了一些困难。 Specifically: 特别:

  • Where does the DAG file actually get executed? DAG文件实际上在哪里执行? What is its context? 它的背景是什么? How can I pass input data into the DAG definition file? 如何将输入数据传递到DAG定义文件中? (Eg, I want ot create a task for each file in a directory.) (例如,我想为目录中的每个文件创建一个任务。)
  • How do I execute a DAG on an ad hoc basis? 如何临时执行DAG? How do I pass parameters for the DAG construction? 如何为DAG构造传递参数?

Here is an example of what I would like to execute. 这是我要执行的示例。 Say I just received some data as a directory containing 20 files available in some shared filesystem. 假设我刚刚在目录中收到一些数据,其中包含一些共享文件系统中可用的20个文件。 I want to execute a DAG pipeline which runs a particular bash command on each of the 20 files, then combines some of the results and performs further processing. 我想执行DAG管道,该管道在20个文件中的每个文件上运行特定的bash命令,然后合并一些结果并执行进一步的处理。 The DAG needs the path on the filesystem and also to list the files in the directory to construct a task for each one. DAG需要文件系统上的路径,还需要列出目录中的文件以为每个文件构造一个任务。

It's probably not necessary for me to pass metadata from one task to another (which I understand is possible through XCom ), as long as I can dynamically construct the entire DAG upfront. 只要我可以动态地预先构建整个DAG,对我来说就没有必要将元数据从一个任务传递到另一任务(我知道可以通过XCom )。 But it's not clear to me how I can pass a path to the DAG construction. 但是我不清楚如何将DAG构造传递给我。

Put another way, I'd like my DAG definition to include something like 换句话说,我希望DAG定义包含类似

dag = DAG(...)
for file in glob(input_path):
    t = BashOperator(..., dag=dag)

How do I get input_path passed in when I want to manually trigger a DAG? 要手动触发DAG时如何传递input_path

I also don't really have need for the cron-style scheduling. 我也确实不需要cron式的调度。

Regarding input_path you can pass it to the DAG using Airflow variables. 关于input_path您可以使用Airflow变量将其传递给DAG。 Example of code used in the DAG file: DAG文件中使用的代码示例:

input_path = Variable.get("INPUT_PATH")

Variables can be imported using Airflow cli or manually through the UI. 可以使用Airflow cli导入变量,也可以通过UI手动导入变量。

You should use a subdag for this type of logic: 对于这种类型的逻辑,应该使用subdag:

dag = DAG(...) for file in glob(input_path): t = BashOperator(..., dag=dag)

SubDAGs are perfect for repeating patterns. SubDAG非常适合重复模式。 Defining a function that returns a DAG object is a nice design pattern when using Airflow. 使用Airflow时,定义返回DAG对象的函数是一种不错的设计模式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM