简体   繁体   English

AWS Glue 截断 Redshift 表

[英]AWS Glue Truncate Redshift Table

I have created a Glue job that copies data from S3 (csv file) to Redshift.我创建了一个将数据从 S3(csv 文件)复制到 Redshift 的 Glue 作业。 It works and populates the desired table.它工作并填充所需的表。

However, I need to purge the table during this process as I am left with duplicate records after the process completes.但是,我需要在此过程中清除表,因为在此过程完成后我会留下重复的记录。

I'm looking for a way to add this purge to the Glue process.我正在寻找一种方法将这种清除添加到 Glue 过程中。 Any advice would be appreciated.任何建议将不胜感激。

Thanks.谢谢。

You can alter the Glue script to perform a "preaction" before insertion as explained here:您可以更改 Glue 脚本以在插入之前执行“预操作”,如下所述:

https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/ https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/

datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame
= datasource0, catalog_connection = "test_red", connection_options = {"preactions":"truncate table target_table;","dbtable": "target_table", "database": "redshiftdb"}, redshift_tmp_dir = 's3://s3path', transformation_ctx = "datasink4")

For instance, for my script which was mostly based on the defaults I inserted a new DataSink before the last DataSink (I've replaced some of my deatils with {things}):例如,对于主要基于默认值的脚本,我在最后一个 DataSink 之前插入了一个新的 DataSink(我已经用 {things} 替换了我的一些细节):

## @type: DataSink
## @args: [catalog_connection = "redshift-data-live", connection_options = {"dbtable": "{DBTABLE}", "database": "{DBNAME}"}, redshift_tmp_dir = TempDir, transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "redshift-data-live", connection_options = {"preactions":"truncate table {TABLENAME};","dbtable": "{SCHEMA.TABLENAME}", "database": "{DB}"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
## @type: DataSink
## @args: [catalog_connection = "redshift-data-live", connection_options = {"dbtable": "{SCHEMA.TABLENAME}", "database": "{DB}"}, redshift_tmp_dir = TempDir, transformation_ctx = "datasink4"]
## @return: datasink5
## @inputs: [frame = datasink4]
datasink5 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasink4, catalog_connection = "redshift-data-live", connection_options = {"dbtable": "{SCHEMA.TABLENAME}", "database": "{DB}"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink5")
job.commit()

The link @frobinrobin provided is out of date, and I tried many times that the preactions statements will be skiped even you provide a wrong syntax, and came out with duplicated rows(insert action did executed!) @frobinrobin 提供的链接已过时,我多次尝试即使您提供错误的语法也会跳过 preactions 语句,并且出现重复的行(插入操作确实执行了!)

Try this:试试这个:

just replace the syntax from glueContext.write_dynamic_frame.from_jdbc_conf() in the link above to glueContext.write_dynamic_frame_from_jdbc_conf() will works!只需将上面链接中的glueContext.write_dynamic_frame.from_jdbc_conf()中的语法替换为glueContext.write_dynamic_frame_from_jdbc_conf()

At least this help me out in my case(AWS Glue job just insert data into Redshift without executing Truncate table actions)至少这对我有帮助(AWS Glue 作业只是将数据插入 Redshift 而不执行 Truncate table 操作)

Did you have a look at Job Bookmarks in Glue ?您是否查看过 Glue中的作业书签 It's a feature for keeping the high water mark and works with s3 only.这是保持高水位标记的功能,仅适用于 s3。 I am not 100% sure, but it may require partitioning to be in place.我不是 100% 确定,但它可能需要分区到位。

You need to modify the auto generated code provided by Glue.您需要修改 Glue 提供的自动生成代码。 Connect to redshift using spark jdbc connection and execute the purge query.使用 spark jdbc 连接连接到 redshift 并执行清除查询。

To spin up Glue containers in redshift VPC;在 redshift VPC 中启动 Glue 容器; specify the connection in glue job, to gain access for redshift cluster.在粘合作业中指定连接,以获得对 redshift 集群的访问。

Hope this helps.希望这会有所帮助。

You can use spark/Pyspark databricks library to do an append after a truncate table of the table (this is better performance than an overwrite):您可以使用 spark/Pyspark databricks 库在表的截断表之后进行追加(这比覆盖性能更好):

preactions = "TRUNCATE table <schema.table>" 
df.write\
  .format("com.databricks.spark.redshift")\
  .option("url", redshift_url)\
  .option("dbtable", redshift_table)\
  .option("user", user)\
  .option("password", readshift_password)\
  .option("aws_iam_role", redshift_copy_role)\
  .option("tempdir", args["TempDir"])\
  .option("preactions", preactions)\
  .mode("append")\
  .save()

You can take a look at databricks documentation in here您可以在此处查看数据块文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM