简体   繁体   English

如何同步运行 Airflow 任务

[英]How to run Airflow tasks synchronously

I have an airflow comprising of 2-3 steps我有一个由 2-3 个步骤组成的气流

  1. PythonOperator --> It runs the query on AWS Athena and stores the generated file on specific s3 path PythonOperator --> 它在 AWS Athena 上运行查询并将生成的文件存储在特定的 s3 路径上
  2. BashOperator --> Increments the airflow variable for tracking BashOperator --> 增加用于跟踪的气流变量
  3. BashOperator --> It takes the output(response) of task1 and and run some code on top of it. BashOperator --> 它获取 task1 的输出(响应)并在其上运行一些代码。

What happens here is the airflow gets completed within seconds even if the Athena query step is running.这里发生的是,即使 Athena 查询步骤正在运行,气流也会在几秒钟内完成。

I want to make sure that after the file is generated further steps should run.我想确保在生成文件后应该运行进一步的步骤。 Basically i want this to be synchronous.基本上我希望这是同步的。

You can set the tasks as:您可以将任务设置为:

def athena_task():
    # Add your code
    return

t1 = PythonOperator(
    task_id='athena_task',
    python_callable=athena_task,
)

t2 = BashOperator(
    task_id='variable_task',
    bash_command='', #replace with relevant command
)

t3 = BashOperator(
    task_id='process_task',
    bash_command='', #replace with relevant command
)

t1 >> t2 >> t3

t2 will run only after t1 is completed successfully and t3 will start only after t2 is completed successfully. t2 仅在 t1 成功完成后才会运行,t3 仅在 t2 成功完成后才会启动。

Note that Airflow has AWSAthenaOperator which might save you the trouble of writing the code yourself.请注意,Airflow 具有AWSAthenaOperator ,这可能会为您省去自己编写代码的麻烦。 The operator submit a query to Athena and save the output in S3 path by setting the output_location parameter:操作员向 Athena 提交查询,并通过设置output_location参数将输出保存在 S3 路径中:

run_query = AWSAthenaOperator(
    task_id='athena_task',
    query='SELECT * FROM  my_table',
    output_location='s3://some-bucket/some-path/',
    database='my_database'
)

Athena's query API is asynchronous. Athena 的查询 API 是异步的。 You start a query, get an ID back, and then you need to poll until the query has completed using the GetQueryExecution API call.您开始查询,取回 ID,然后您需要使用GetQueryExecution API 调用进行轮询,直到查询完成。

If you only start the query in the first task then there is not guarantee that the query has completed when the next task runs.如果您仅在第一个任务中启动查询,则无法保证在下一个任务运行时查询已完成。 Only when GetQueryExecution has returned a status of SUCCEEDED (or FAILED / CANCELLED ) can you expect the output file to exist.只有当GetQueryExecution返回SUCCEEDED (或FAILED / CANCELLED )状态时,您才能期望输出文件存在。

As @Elad points out, AWSAthenaOperator does this for you, and handles error cases, and more.正如@Elad 指出的那样, AWSAthenaOperator会为您执行此操作,并处理错误情况等等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM