簡體 English 中英

當ETL作業被破壞兩次時，在表中獲取重復.ETL作業將數據從RDS提取到S3存儲桶

[英]Getting duplicates in the Table when an ETL job Is ruined twice.ETL job fetch data from RDS to S3 bucket

原文 2019-01-30 11:20:41 4 2 amazon-web-services/ etl/ upsert/ aws-glue/ staging-table

當運行ETL作業時，它會正確執行，但由於表沒有時間戳，因此在運行同一ETL作業時它將復制數據。如何使用Upsert執行暫存和解決此問題，或者歡迎其他任何人回答。我是否擺脫了這個問題，找到的解決方案是在其中添加時間戳或進行分段，還是有其他方法？

2 個解決方案

在將數據寫入s3時，U可以使用overwrite 。 它將替換原始數據

為了防止在s3上重復，需要在保存之前從目標加載數據並過濾掉現有記錄：

val deltaDf = newDataDf.alias("new")
  .join(existingDf.alias("existing"), "id", "left_outer")
  .where(col("existing.id").isNull)
  .select("new.*")

glueContext.getSinkWithFormat(
    connectionType = "s3",
    options = JsonOptions(Map(
      "path" -> path
    )),
    transformationContext = "save_to_s3"
    format = "avro"
  ).writeDynamicFrame(DynamicFrame(deltaDf, glueContext))

但是，此方法不會覆蓋更新的記錄。

另一個選擇是也使用一些updated_at字段保存更新的記錄，下游使用者可以使用該字段獲取最新值。

您還可以考慮在每次運行作業時將數據集轉儲到單獨的文件夾中（即每天在data/dataset_date=<year-month-day>有完整的數據轉儲）

import org.apache.spark.sql.functions._

val datedDf = sourceDf.withColumn("dataset_date", current_date())

glueContext.getSinkWithFormat(
    connectionType = "s3",
    options = JsonOptions(Map(
      "path" -> path,
      "partitionKeys" -> Array("dataset_date")
    )),
    transformationContext = "save_to_s3"
    format = "avro"
  ).writeDynamicFrame(DynamicFrame(datedDf, glueContext))

從 AWS Redshift 到 S3 的 AWS Glue ETL 作業失敗

[英]AWS Glue ETL job from AWS Redshift to S3 fails

運行 AWS Glue ETL 作業並命名 output 文件名時，有沒有辦法從 S3 存儲桶讀取文件名。 pyspark 是否提供了一種方法來做到這一點？

[英]Is there a way to read filename from S3 bucket when running AWS Glue ETL job and name the output filename. Does pyspark provide a way to do it?

AWS膠水ETL作業在批次的S3事件上觸發

[英]AWS Glue ETL Job triggered on batches of S3 Events

如何從用 pyspark 編寫的膠水 ETL 作業中保存 S3 中的機器學習模型（Kmeans）？

[英]How do I save machine learning model(Kmeans) in S3 from glue ETL job in written in pyspark?

AWS Glue ETL：將數據傳輸到S3存儲桶

[英]AWS Glue ETL : transfer data to S3 Bucket

如何編寫ETL作業將mysql數據庫表轉移到另一個mysql rds數據庫

[英]How To write the ETL job to transfer the mysql database table to another mysql rds database

將數據從Kinesis（或s3）傳輸到RDS postgres chron作業

[英]Transfer data from Kinesis (or s3) to RDS postgres chron job

ETL - 將數據從oracle db推送到aws s3

[英]ETL - Push data from oracle db to aws s3

從 Amazon DMS 到 S3 再到 Redshift 的 ETL 數據

[英]ETL Data from Amazon DMS to S3 to Redshift

如何通過 S3 事件或 AWS Lambda 觸發 Glue ETL Pyspark 作業？

[英]How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda?

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 從 AWS Redshift 到 S3 的 AWS Glue ETL 作業失敗運行 AWS Glue ETL 作業並命名 output 文件名時，有沒有辦法從 S3 存儲桶讀取文件名。 pyspark 是否提供了一種方法來做到這一點？ AWS膠水ETL作業在批次的S3事件上觸發如何從用 pyspark 編寫的膠水 ETL 作業中保存 S3 中的機器學習模型（Kmeans）？ AWS Glue ETL：將數據傳輸到S3存儲桶如何編寫ETL作業將mysql數據庫表轉移到另一個mysql rds數據庫將數據從Kinesis（或s3）傳輸到RDS postgres chron作業 ETL - 將數據從oracle db推送到aws s3 從 Amazon DMS 到 S3 再到 Redshift 的 ETL 數據如何通過 S3 事件或 AWS Lambda 觸發 Glue ETL Pyspark 作業？

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM