使用 AWS Glue ETL 將鑲木地板文件從 S3 加載到 AWS RDS 需要很長時間

Question

我正在編寫 Glue ETL 腳本以從 S3（大約 2GB）讀取鑲木地板文件並將其上傳到 RDS。 但是，運行時間超過 14 小時。 是否有任何解決方案可以使其更快？

我讀過自定義 JDBC 驅動程序非常慢，某些轉換無法並行化，但我不確定這有多准確。

這是經過編輯的示例代碼（注意：此腳本是使用 Glue studio 創建和編輯的）：

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame

import boto3
import re

# def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
#     for alias, frame in mapping.items():
#         frame.toDF().createOrReplaceTempView(alias)
#     result = spark.sql(query)
#     return DynamicFrame.fromDF(result, glueContext, transformation_ctx)


args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node farmlands_raw.fact_inventory
s3_client = boto3.client("s3")
result = s3_client.list_objects(
    Bucket="XXXX",
    Prefix=f"XXXX",
    Delimiter="/",
)
dates = []
for o in result.get("CommonPrefixes"):
    prefix = o.get("Prefix")
    dates.append(
        int(re.search(r"\d{8}", prefix).group())
    )
latest_date = max(dates)

print(latest_date)

node1661227531364 = (
    glueContext.create_dynamic_frame.from_catalog(
        database="XXXX",
        table_name="XXXXX",
        push_down_predicate = f"upload_date = {latest_date}",
        transformation_ctx="XXXXX",
    )
)


# # Script generated for node SQL
# SqlQuery9 = """
# select * from fact_inventory limit 10
# """
# SQL_node1661227556612 = sparkSqlQuery(
#     glueContext,
#     query=SqlQuery9,
#     mapping={"fact_inventory": node1661227531364},
#     transformation_ctx="SQL_node1661227556612",
# )

# Script generated for node Apply Mapping
ApplyMapping_node1661227598286 = ApplyMapping.apply(
    frame=node1661227531364,
    mappings=[
        .... redacted 
    ],
    transformation_ctx="ApplyMapping_node1661227598286",
)

# Script generated for node admin_ 579445444291_rds_connector
# admin_579445444291_rds_connector_node1661227607272 = (
#     glueContext.write_dynamic_frame.from_options(
#         frame=ApplyMapping_node1661227598286,
#         connection_type="custom.jdbc",
#         connection_options={
#             "tableName": "fact_inventory_staging_temp",
#             "dbTable": "fact_inventory_staging_temp",
#             "connectionName": "qu_farmlands_raw_merged",
#         },
#         transformation_ctx="admin_579445444291_rds_connector_node1661227607272",
#     )
# )

glueContext.write_dynamic_frame.from_options(
        frame=ApplyMapping_node1661227598286,
        connection_type="custom.jdbc",
        connection_options={
            "tableName": "xxxx",
            "dbTable": "xxxx",
            "connectionName": "xxxx",
        },
        transformation_ctx="xxxx",
)

job.commit()

Answer 1

您可以嘗試在動態框架創建中使用此選項： additional_options = {"useS3ListImplementation":True}

問候，

使用 AWS Glue ETL 將鑲木地板文件從 S3 加載到 AWS RDS 需要很長時間

問題描述

1 個解決方案

解決方案1
0 2022-11-17 07:59:59

使用 AWS Glue ETL 將鑲木地板文件從 S3 加載到 AWS RDS 需要很長時間

問題描述

1 個解決方案

解決方案1 0 2022-11-17 07:59:59

解決方案1
0 2022-11-17 07:59:59