简体繁体 English

写入 Synapse DWH 池时出现 Spark 错误

[英]Spark errors when writing to Synapse DWH pool

原文 2021-09-27 15:59:52 9 2 apache-spark/ apache-spark-sql/ azure-sql-database/ databricks/ azure-synapse

I am trying to write a dataframe in either append/overwrite mode into a Synapse table using ("com.databricks.spark.sqldw") connector.The official docs doesn't mention much about ACID properties of this write operation.我正在尝试使用 ("com.databricks.spark.sqldw") 连接器以追加/覆盖模式将 dataframe 写入 Synapse 表。官方文档并未提及此写入操作的 ACID 属性。 My question is that, if the write operation fails in the middle of the write, would the actions preformed previously be rolled back?我的问题是，如果写入操作在写入过程中失败，是否会回滚之前执行的操作？

One thing that the docs does mention is that there are two classes of exception that could be thrown during this operation: SqlDWConnectorException and SqlDWSideException.My logic is that if the write operation is ACID compliant,then we do not do anything,but if not,then we plan to encapsulate this operation in a try-catch block and look for other options(maybe retry,or timeout).文档确实提到的一件事是，在此操作期间可能会抛出两类异常：SqlDWConnectorException 和 SqlDWSideException。我的逻辑是，如果写操作符合 ACID，那么我们什么都不做，但如果不是，然后我们计划将此操作封装在一个 try-catch 块中，并寻找其他选项（可能是重试，或超时）。

2 个解决方案

It has guaranteed ACID transaction behavior.它保证了 ACID 事务行为。

Refer: What is Delta Lake , where it states:参考： What is Delta Lake ，其中指出：

Azure Synapse Analytics is compatible with Linux Foundation Delta Lake. Azure Synapse Analytics 与 Linux Foundation Delta Lake 兼容。 Delta Lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and durability) transactions to Apache Spark and big data workloads. Delta Lake 是一个开源存储层，可为 Apache Spark 和大数据工作负载带来 ACID（原子性、一致性、隔离性和持久性）事务。 This is fully managed using Apache Spark APIs available in Azure Synapse.这是使用 Apache Synapse 中可用的 Apache Spark API 完全管理的。

As a good practice you should write your code to be re-runnable, eg delete potentially duplicate records.作为一种好的做法，您应该编写可重新运行的代码，例如删除可能重复的记录。 Imagine you are re-running a file for a failed day or someone want to reprocess a certain period.想象一下，您正在为失败的一天重新运行一个文件，或者有人想要重新处理某个时期。 However SQL pools does implement ACID through transaction isolation levels:然而 SQL 池确实通过事务隔离级别实现了 ACID：

Use transactions in a SQL pool in Azure Synapse 在 Azure Synapse 中使用 SQL 池中的交易

SQL pool implements ACID transactions. SQL 池实现 ACID 事务。 The isolation level of the transactional support is default to READ UNCOMMITTED.事务支持的隔离级别默认为 READ UNCOMMITTED。 You can change it to READ COMMITTED SNAPSHOT ISOLATION by turning ON the READ_COMMITTED_SNAPSHOT database option for a user SQL pool when connected to the master database.您可以通过在连接到主数据库时为用户 SQL 池打开 READ_COMMITTED_SNAPSHOT 数据库选项，将其更改为 READ COMMITTED SNAPSHOT ISOLATION。

You should bear in mind that the default transaction isolation level for dedicated SQL pools is READ UNCOMMITTED which does allow dirty reads.您应该记住，专用 SQL 池的默认事务隔离级别是READ UNCOMMITTED ，它允许脏读。 So the way I think about it is, ACID (Atomic, Consistent, Isolated, Durable) is a standard and each provider implements the standard to different degrees through transaction isolation levels.所以我的想法是，ACID（原子的、一致的、隔离的、持久的）是一个标准，每个供应商都通过事务隔离级别在不同程度上实施该标准。 Each transaction isolation level can be strongly meeting ACID or weakly meeting ACID.每个事务隔离级别可以是强满足 ACID 或弱满足 ACID。 Here is my summary for READ UNCOMMITTED :这是我对READ UNCOMMITTED的总结：

A - you should reasonably expect your transaction to be atomic but you should (IMHO) write your code to be re-runnable A - 你应该合理地期望你的事务是原子的，但你应该（恕我直言）编写你的代码以重新运行
C - you should reasonably expect your transaction to be consistent but bear in mind dedicated SQL pools does not support foreign keys and the NOT ENFORCED keyword is applied to unique indexes on creation. C - 您应该合理地期望您的交易是一致的，但请记住专用的 SQL 池不支持外键，并且 NOT ENFORCED 关键字在创建时应用于唯一索引。
I - READ UNCOMMITED does not meet 'I' Isolated criteria of ACID, allowing dirty reads (uncommitted data) but the gain is concurrency. I - READ UNCOMMITED不符合 ACID 的“I”隔离标准，允许脏读（未提交的数据）但获得并发性。 You can change the default to READ COMMITTED SNAPSHOT ISOLATION as described above, but you would need a good reason to do so and conduct extensive tests on your application as the impacts on behaviour, performance, concurrency etc您可以如上所述将默认值更改为READ COMMITTED SNAPSHOT ISOLATION ，但您需要有充分的理由这样做并对您的应用程序进行广泛的测试，以了解对行为、性能、并发性等的影响
D - you should reasonably expect your transaction to be durable D - 你应该合理地期望你的交易是持久的

So the answer to your question is, depending on your transaction isolation level (bearing in mind the default is READ UNCOMMITTED in a dedicated SQL pool), each transaction meets ACID to a degree, most notably Isolation (I) is not fully met.所以你的问题的答案是，根据你的事务隔离级别（记住默认是在专用的 SQL 池中READ UNCOMMITTED ），每个事务在一定程度上满足 ACID，最显着的是隔离（I）没有完全满足。 You have the opportunity to change this by altering the default transaction at the cost of reducing concurrency and the now obligatory regression test.您有机会通过以减少并发性和现在强制性回归测试为代价更改默认事务来更改此设置。 I think you are most interested in A tomicity and my advice is there, make sure your code is re-runnable anyway.我认为您对Atomicity 最感兴趣，我的建议就在那里，确保您的代码无论如何都可以重新运行。

You tend to see the 'higher' transaction isolation levels ( READ SERIALIZABLE ) in more OLTP systems rather than MPP systems like Synapse, the cost being concurrency.您倾向于在更多 OLTP 系统中看到“更高”的事务隔离级别 ( READ SERIALIZABLE )，而不是像 Synapse 这样的 MPP 系统，代价是并发性。 You want your bank withdrawal to work right?您希望您的银行提款工作正常吗？