簡體   English   中英

Apache Airflow - 在 AWS MWAA 上解析 SQL 查詢很慢

[英]Apache Airflow - Slow to parse SQL queries on AWS MWAA

我正在嘗試在 AWS MWAA 上構建 DAG,此 DAG 會將數據從 Postgres (RDS) 導出到 S3,但是一旦 MWAA 嘗試解析我的任務中的所有查詢,它就會出現問題,總共它將導出 385 個表,但 DAG 卡在運行模式下,無法啟動我的任務。

基本上,這個過程將:

  1. 加載表架構
  2. 重命名某些列
  3. 將數據導出到 S3

Function

def export_to_s3(dag, conn, db, pg_hook, export_date, s3_bucket, schemas):

    tasks = []
    run_queries = []
    
    for schema, features in schemas.items():
        t = features.get("tables")
        if t:
            tables = t
        else:
            tables = helper.get_tables(pg_hook, schema).table_name.tolist()

        is_full_export = features.get("full")

        for table in tables:
            columns = helper.get_table_schema(
                pg_hook, table, schema
            ).column_name.tolist()
            masked_columns = helper.masking_pii(columns, pii_columns=PII_COLS)
            masked_columns_str = ",\n".join(masked_columns)

            if is_full_export:
                statement = f'select {masked_columns_str} from {db}.{schema}."{table}"'
            else:
                statement = f'select {masked_columns_str} from {db}.{schema}."{table}" order by random() limit 10000'
            s3_bucket_key = export_date + "_" + schema + "_" + table + ".csv"
            sql_export = f"""
            SELECT * from aws_s3.query_export_to_s3(
                '{statement}',
                    aws_commons.create_s3_uri(
                        '{s3_bucket}',
                        '{s3_bucket_key}',
                        'ap-southeast-2'),
                        options := 'FORMAT csv, DELIMITER $$|$$'
            )""".strip()
            run_queries.append(sql_export)



   def get_table_schema(pg_hook, table_name, table_schema):
        """ Gets the schema details of a given table in a given schema."""
        query = """
        SELECT column_name, data_type
        FROM information_schema.columns
        WHERE table_schema = '{0}'
          AND table_name = '{1}'
        order by ordinal_position
        """.format(table_schema, table_name)
    
        df_schema = pg_hook.get_pandas_df(query)
        return df_schema
    
    
    def get_tables(pg_hook, schema):
        query = """
        select table_name from information_schema.tables
        where table_schema = '{}' and table_type = 'BASE TABLE' and table_name != '_sdc_rejected' """.format(schema)
    
        df_schema = pg_hook.get_pandas_df(query)
        return df_schema

任務

 task = PostgresOperator(
        sql=run_queries,
        postgres_conn_id=conn,
        task_id="export_to_s3",
        dag=dag,
        autocommit=True,
    )

    tasks.append(task)

    return tasks

Airflow list_dags output

DAGS
-------------------------------------------------------------------
mydag
-------------------------------------------------------------------
DagBag loading stats for /usr/local/airflow/dags
-------------------------------------------------------------------
Number of DAGs: 1
Total task number: 3
DagBag parsing time: 159.94030800000002
-----------------------------------------------------+--------------------+---------+----------
file                                                 | duration           | dag_num | task_num 
-----------------------------------------------------+--------------------+---------+----------
/mydag.py                                            | 159.05215199999998 |       1 |        3 
/ActivationPriorityCallList/CallList_Generator.py    | 0.878734           |       0 |        0 
/ActivationPriorityCallList/CallList_Preprocessor.py | 0.00744            |       0 |        0 
/ActivationPriorityCallList/CallList_Emailer.py      | 0.001154           |       0 |        0 
/airflow_helperfunctions.py                          | 0.000828           |       0 |        0 
-----------------------------------------------------+--------------------+---------+----------

觀察

如果我只允許在任務中加載一個表,它工作得很好,但如果所有表都可以加載,則失敗。 如果從指向 RDS 的 docker 執行 Airflow,則此行為相同

airflow list_dags 的屏幕截圖:

在此處輸入圖像描述

當我在 MWAA 上更改這些值時,問題就解決了。

  • webserver.web_server_master_timeout
  • webserver.web_server_worker_timeout

默認值為 30,我將其更改為 480。

與文檔鏈接。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM