簡體   English   中英

AWS Glue 僅寫入最新分區 parquet

[英]AWS Glue write only newest partitions parquet

我有一個膠水數據庫,它有兩個表,每個表都有相同的數據,只是分區不同。 我正在嘗試編寫一個每晚運行的作業,從一個表中讀取數據,然后使用更新的分區寫入新數據。 我可以使用以下代碼做到這一點:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import lit

glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "Database",
    table_name = "Table",
    transformation_ctx = "datasource0"
)

datasource0 = datasource0.toDF()

datasource0.write.partitionBy("Key1","Key2").parquet(OutputFilePath)

但這會占用並寫入整個數據幀。 我只想寫新分區,所以我在 AWS 網站上找到了以下代碼段:

glue_context.write_dynamic_frame.from_options(
    frame = projectedEvents,
    connection_type = "s3",    
    connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
    format = "parquet")

但這也只是重寫了整個數據幀。 我怎樣才能重寫最新的分區?

也許看看書簽,它就像一個檢查點機制,以避免重新處理以前處理過的數據: https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

這可以通過 push_down_predicate 參數來完成。 數據原來是按年、月、日、小時划分的,所以我只是減去一天,然后使用 push_down_predicate 如下:

timestamp = (datetime.datetime.now() - datetime.timedelta(days=1)).strftime('%Y-%m-%d')
s1 = timestamp.split('-')

pdp = "partition_0 = " + s1[0] + " and partition_1 = " + s1[1] + " and partition_2 = " + s1[2]

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "mailfiles_standardized", 
    table_name = "firehoseoutput", 
    push_down_predicate = pdp
)

glueContext.write_dynamic_frame.from_options(
frame = datasource2,
connection_type = "s3",
connection_options = {
    "path": Bucket, 
    "partitionKeys": ["Key1","Key2"]
    },
format = "parquet")

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM