[英]How to write two Spark DataFrames to Redshift atomically?
I am using Databricks spark-redshift to write DataFrames to Redshift. 我正在使用Databricks spark-redshift将DataFrames写入Redshift。 I have two DataFrames that get appended to two separate tables, but I need this to happen atomically, ie if the second DataFrame fails to write to its table, I'll need the first one to be undone as well. 我有两个DataFrame追加到两个单独的表中,但是我需要原子地执行此操作,即,如果第二个DataFrame无法写入其表,则也需要撤消第一个。 Is there any way to do that? 有什么办法吗?
The solution is to have a staging table for each target table. 解决方案是为每个目标表都有一个临时表。 To write Spark results to the database: 要将Spark结果写入数据库:
DELETE FROM staging_table
) 清理登台表( DELETE FROM staging_table
) redshift-sqlalchemy
package). 从临时表以原子方式复制到事务中的目标表(对于Python,请使用redshift-sqlalchemy
包)。 Only one instance of the Spark application can be running at a time, ie you can't have two jobs writing to staging tables at the same time, otherwise the resulting data won't be valid. 一次只能运行一个Spark应用程序实例, 即您不能同时有两个作业写入临时表,否则结果数据将无效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.