簡體   English   中英

使用 AWS Glue ETL 將鑲木地板文件從 S3 加載到 AWS RDS 需要很長時間

[英]Loading parquet file from S3 to AWS RDS taking extremely long time using AWS Glue ETL

我正在編寫 Glue ETL 腳本以從 S3(大約 2GB)讀取鑲木地板文件並將其上傳到 RDS。 但是,運行時間超過 14 小時。 是否有任何解決方案可以使其更快?

我讀過自定義 JDBC 驅動程序非常慢,某些轉換無法並行化,但我不確定這有多准確。

這是經過編輯的示例代碼(注意:此腳本是使用 Glue studio 創建和編輯的):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame

import boto3
import re

# def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
#     for alias, frame in mapping.items():
#         frame.toDF().createOrReplaceTempView(alias)
#     result = spark.sql(query)
#     return DynamicFrame.fromDF(result, glueContext, transformation_ctx)


args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node farmlands_raw.fact_inventory
s3_client = boto3.client("s3")
result = s3_client.list_objects(
    Bucket="XXXX",
    Prefix=f"XXXX",
    Delimiter="/",
)
dates = []
for o in result.get("CommonPrefixes"):
    prefix = o.get("Prefix")
    dates.append(
        int(re.search(r"\d{8}", prefix).group())
    )
latest_date = max(dates)

print(latest_date)

node1661227531364 = (
    glueContext.create_dynamic_frame.from_catalog(
        database="XXXX",
        table_name="XXXXX",
        push_down_predicate = f"upload_date = {latest_date}",
        transformation_ctx="XXXXX",
    )
)


# # Script generated for node SQL
# SqlQuery9 = """
# select * from fact_inventory limit 10
# """
# SQL_node1661227556612 = sparkSqlQuery(
#     glueContext,
#     query=SqlQuery9,
#     mapping={"fact_inventory": node1661227531364},
#     transformation_ctx="SQL_node1661227556612",
# )

# Script generated for node Apply Mapping
ApplyMapping_node1661227598286 = ApplyMapping.apply(
    frame=node1661227531364,
    mappings=[
        .... redacted 
    ],
    transformation_ctx="ApplyMapping_node1661227598286",
)

# Script generated for node admin_ 579445444291_rds_connector
# admin_579445444291_rds_connector_node1661227607272 = (
#     glueContext.write_dynamic_frame.from_options(
#         frame=ApplyMapping_node1661227598286,
#         connection_type="custom.jdbc",
#         connection_options={
#             "tableName": "fact_inventory_staging_temp",
#             "dbTable": "fact_inventory_staging_temp",
#             "connectionName": "qu_farmlands_raw_merged",
#         },
#         transformation_ctx="admin_579445444291_rds_connector_node1661227607272",
#     )
# )

glueContext.write_dynamic_frame.from_options(
        frame=ApplyMapping_node1661227598286,
        connection_type="custom.jdbc",
        connection_options={
            "tableName": "xxxx",
            "dbTable": "xxxx",
            "connectionName": "xxxx",
        },
        transformation_ctx="xxxx",
)

job.commit()

您可以嘗試在動態框架創建中使用此選項: additional_options = {"useS3ListImplementation":True}

問候,

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM