简体   繁体   English

如何编写两个Spark DataFrames以原子方式进行Redshift?

[英]How to write two Spark DataFrames to Redshift atomically?

I am using Databricks spark-redshift to write DataFrames to Redshift. 我正在使用Databricks spark-redshift将DataFrames写入Redshift。 I have two DataFrames that get appended to two separate tables, but I need this to happen atomically, ie if the second DataFrame fails to write to its table, I'll need the first one to be undone as well. 我有两个DataFrame追加到两个单独的表中,但是我需要原子地执行此操作,即,如果第二个DataFrame无法写入其表,则也需要撤消第一个。 Is there any way to do that? 有什么办法吗?

The solution is to have a staging table for each target table. 解决方案是为每个目标表都有一个临时表。 To write Spark results to the database: 要将Spark结果写入数据库:

  1. Clean staging tables ( DELETE FROM staging_table ) 清理登台表( DELETE FROM staging_table
  2. Write the data frames to staging tables using spark-redshift (not atomic) 使用spark-redshift(非原子)将数据帧写入登台表
  3. Atomically copy from staging tables to target tables in a transaction (for Python use redshift-sqlalchemy package). 从临时表以原子方式复制到事务中的目标表(对于Python,请使用redshift-sqlalchemy包)。

Only one instance of the Spark application can be running at a time, ie you can't have two jobs writing to staging tables at the same time, otherwise the resulting data won't be valid. 一次只能运行一个Spark应用程序实例, 您不能同时有两个作业写入临时表,否则结果数据将无效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM