简体   繁体   English

Apache Airflow - 在 AWS MWAA 上解析 SQL 查询很慢

[英]Apache Airflow - Slow to parse SQL queries on AWS MWAA

I'm trying to build a DAG on AWS MWAA, this DAG will export data from Postgres (RDS) to S3, but it's getting an issue once the MWAA tries to parse all queries from my task, in total it will export 385 tables, but the DAG gets stuck on running mode and does not start my task.我正在尝试在 AWS MWAA 上构建 DAG,此 DAG 会将数据从 Postgres (RDS) 导出到 S3,但是一旦 MWAA 尝试解析我的任务中的所有查询,它就会出现问题,总共它将导出 385 个表,但 DAG 卡在运行模式下,无法启动我的任务。

Basically, this process will:基本上,这个过程将:

  1. Load the table schema加载表架构
  2. Rename Some Columns重命名某些列
  3. Export data to S3将数据导出到 S3

Function Function

def export_to_s3(dag, conn, db, pg_hook, export_date, s3_bucket, schemas):

    tasks = []
    run_queries = []
    
    for schema, features in schemas.items():
        t = features.get("tables")
        if t:
            tables = t
        else:
            tables = helper.get_tables(pg_hook, schema).table_name.tolist()

        is_full_export = features.get("full")

        for table in tables:
            columns = helper.get_table_schema(
                pg_hook, table, schema
            ).column_name.tolist()
            masked_columns = helper.masking_pii(columns, pii_columns=PII_COLS)
            masked_columns_str = ",\n".join(masked_columns)

            if is_full_export:
                statement = f'select {masked_columns_str} from {db}.{schema}."{table}"'
            else:
                statement = f'select {masked_columns_str} from {db}.{schema}."{table}" order by random() limit 10000'
            s3_bucket_key = export_date + "_" + schema + "_" + table + ".csv"
            sql_export = f"""
            SELECT * from aws_s3.query_export_to_s3(
                '{statement}',
                    aws_commons.create_s3_uri(
                        '{s3_bucket}',
                        '{s3_bucket_key}',
                        'ap-southeast-2'),
                        options := 'FORMAT csv, DELIMITER $$|$$'
            )""".strip()
            run_queries.append(sql_export)



   def get_table_schema(pg_hook, table_name, table_schema):
        """ Gets the schema details of a given table in a given schema."""
        query = """
        SELECT column_name, data_type
        FROM information_schema.columns
        WHERE table_schema = '{0}'
          AND table_name = '{1}'
        order by ordinal_position
        """.format(table_schema, table_name)
    
        df_schema = pg_hook.get_pandas_df(query)
        return df_schema
    
    
    def get_tables(pg_hook, schema):
        query = """
        select table_name from information_schema.tables
        where table_schema = '{}' and table_type = 'BASE TABLE' and table_name != '_sdc_rejected' """.format(schema)
    
        df_schema = pg_hook.get_pandas_df(query)
        return df_schema

Task任务

 task = PostgresOperator(
        sql=run_queries,
        postgres_conn_id=conn,
        task_id="export_to_s3",
        dag=dag,
        autocommit=True,
    )

    tasks.append(task)

    return tasks

Airflow list_dags output Airflow list_dags output

DAGS
-------------------------------------------------------------------
mydag
-------------------------------------------------------------------
DagBag loading stats for /usr/local/airflow/dags
-------------------------------------------------------------------
Number of DAGs: 1
Total task number: 3
DagBag parsing time: 159.94030800000002
-----------------------------------------------------+--------------------+---------+----------
file                                                 | duration           | dag_num | task_num 
-----------------------------------------------------+--------------------+---------+----------
/mydag.py                                            | 159.05215199999998 |       1 |        3 
/ActivationPriorityCallList/CallList_Generator.py    | 0.878734           |       0 |        0 
/ActivationPriorityCallList/CallList_Preprocessor.py | 0.00744            |       0 |        0 
/ActivationPriorityCallList/CallList_Emailer.py      | 0.001154           |       0 |        0 
/airflow_helperfunctions.py                          | 0.000828           |       0 |        0 
-----------------------------------------------------+--------------------+---------+----------

Observation观察

If I enable only one table to be loaded in the task, it works well, but fails if all tables are enabled to be loaded.如果我只允许在任务中加载一个表,它工作得很好,但如果所有表都可以加载,则失败。 This behavior is the same if execute Airflow from docker pointing out to RDS如果从指向 RDS 的 docker 执行 Airflow,则此行为相同

Screenshot from the airflow list_dags: airflow list_dags 的屏幕截图:

在此处输入图像描述

The issue was solved when I changed those values on MWAA.当我在 MWAA 上更改这些值时,问题就解决了。

  • webserver.web_server_master_timeout webserver.web_server_master_timeout
  • webserver.web_server_worker_timeout webserver.web_server_worker_timeout

The default value is 30, I changed it to 480.默认值为 30,我将其更改为 480。

Link with documentation.与文档链接。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 AWS MWAA:尝试在 Docker 上运行本地 Apache Airflow 环境时出现“无法访问 Postgres”错误 - AWS MWAA: 'Postgres not reachable' error when trying to run a local Apache Airflow environment on Docker 在 Apache Airflow 中实现 Postgres Sql - Implementing Postgres Sql in Apache Airflow 配置 MWAA 以在 Aurora DB 中运行查询 - Configuring MWAA to run queries in Aurora DB MWAA Airflow 作业在连接到 postgres 时出现 SCRAM 错误 - MWAA Airflow job getting SCRAM error when connecting to postgres 在没有数据库的情况下在Apache Ignite中使用SQL查询 - Using SQL Queries in Apache Ignite without a database 在 Google Cloud SQL PostgreSQL 实例上记录慢查询 - Logging slow queries on Google Cloud SQL PostgreSQL instances 带有条件语句的 Apache 气流 - Apache airflow with conditional statements 带有AWS SQS的Airflow CeleryExecutor - Airflow CeleryExecutor With AWS SQS Django: SQL 由于复杂的子查询,查询速度很慢,如何拆分或加快它们的速度? - Django: SQL queries are slow due to complex sub-queries, how to split or speed them up? Postgres SQL 查询慢,大表(AWS RDS) - Postgres SQL query slow with large table (AWS RDS)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM