如何在 Databricks 的 Iceberg 表上执行 Spark SQL 合并语句？

Question

I'm trying to get Apache Iceberg set up in our Databricks environment and running into an error when executing a MERGE statement in Spark SQL.我试图在我们的 Databricks 环境中设置 Apache Iceberg，但在 Spark SQL 中执行MERGE语句时遇到错误。

This code:这段代码：

CREATE TABLE iceberg.db.table (id bigint, data string) USING iceberg;

INSERT INTO iceberg.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c');

INSERT INTO iceberg.db.table SELECT id, data FROM (select * from iceberg.db.table) t WHERE length(data) = 1;

MERGE INTO iceberg.db.table t USING (SELECT * FROM iceberg.db.table) u ON t.id = u.id
WHEN NOT MATCHED THEN INSERT *

Generates this error:生成此错误：

Error in SQL statement: AnalysisException: MERGE destination only supports Delta sources.
Some(RelationV2[id#116L, data#117] iceberg.db.table

I believe the root of the issue is that MERGE is also a keyword for the Delta Lake SQL engine.我认为问题的根源在于MERGE也是 Delta Lake SQL 引擎的关键字。 From what I can tell, this issue is stemming from the order in which Spark tries to execute the plan.据我所知，这个问题源于 Spark 尝试执行计划的顺序。 MERGE triggers the delta rule and then throws an error because it's not a delta table. MERGE触发增量规则然后抛出错误，因为它不是增量表。 I'm able to read, append, and overwrite to iceberg tables without issue.我能够读取 append 并毫无问题地覆盖到冰山表。

Primary Question: How can I get Spark to recognize this as an Iceberg query and not Delta?主要问题：如何让 Spark 将其识别为 Iceberg 查询而不是 Delta？ Or is it possible to remove the delta-related SQL rules altogether?或者是否可以完全删除与 delta 相关的 SQL 规则？

Environment环境

Spark version: 3.0.1星火版本： 3.0.1

Databricks runtime version: 7.6 Databricks 运行时版本： 7.6

Iceberg configs冰山配置

spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg.type=hadoop
spark.sql.catalog.iceberg.warehouse=BLOB_STORAGE_CONTAINER

Stack trace:堆栈跟踪：

com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: MERGE destination only supports Delta sources.
Some(RelationV2[id#116L, data#117] iceberg.db.table
);
    at com.databricks.sql.transaction.tahoe.DeltaErrors$.notADeltaSourceException(DeltaErrors.scala:343)
    at com.databricks.sql.transaction.tahoe.PreprocessTableMerge.apply(PreprocessTableMerge.scala:201)
    at com.databricks.sql.transaction.tahoe.PreprocessTableMergeEdge$$anonfun$apply$1.applyOrElse(PreprocessTableMergeEdge.scala:39)
    at com.databricks.sql.transaction.tahoe.PreprocessTableMergeEdge$$anonfun$apply$1.applyOrElse(PreprocessTableMergeEdge.scala:36)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:112)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:112)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:216)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:110)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:108)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
    at com.databricks.sql.transaction.tahoe.PreprocessTableMergeEdge.apply(PreprocessTableMergeEdge.scala:36)
    at com.databricks.sql.transaction.tahoe.PreprocessTableMergeEdge.apply(PreprocessTableMergeEdge.scala:29)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:152)```

Answer 1

I believe the error here is that Databricks always preempts other extensions added to the Spark Session. This means you cannot execute the Iceberg codepath and only the Databricks extensions will ever be used.我认为这里的错误是 Databricks 总是抢占添加到 Spark Session 的其他扩展。这意味着您无法执行 Iceberg 代码路径，并且只会使用 Databricks 扩展。 I would ask your Databricks rep if there is a way to allow the Iceberg Extensions to be placed first or if they can consider allowing other Merge implementations.我会问你的 Databricks 代表是否有办法允许首先放置 Iceberg 扩展，或者他们是否可以考虑允许其他 Merge 实现。

Answer 2

Only INSERT operations are allowed for non-delta sources.非增量源仅允许 INSERT 操作。 DELETE and MERGE operations are not allowed.不允许进行 DELETE 和 MERGE 操作。

Answer 3

Not exactly what you're looking for, but Databricks allows to convert an Iceberg table in-place (no data copying) into a Delta table --不完全是您要找的东西，但 Databricks 允许将 Iceberg 表就地（无数据复制）转换为 Delta 表——

https://docs.databricks.com/delta/delta-utility.html#convert-iceberg-to-delta https://docs.databricks.com/delta/delta-utility.html#convert-iceberg-to-delta

Requires DBR 10.4+需要 DBR 10.4+

-- Convert the Iceberg table in the path <path-to-table>.
CONVERT TO DELTA iceberg.`<path-to-table>`

-- Convert the Iceberg table in the path <path-to-table> without collecting statistics.
CONVERT TO DELTA iceberg.`<path-to-table>` NO STATISTICS

Then run MERGE on the Delta table.然后在 Delta 表上运行 MERGE。

If Iceberg has the same Iceberg to Delta in-place upgrade (I am not sure), this would solve the original problem.如果 Iceberg 有相同的 Iceberg 到 Delta 就地升级（我不确定），这将解决原来的问题。

如何在 Databricks 的 Iceberg 表上执行 Spark SQL 合并语句？

问题描述

3 个解决方案

解决方案1
2 2021-09-24 20:06:33

解决方案2
0 2021-10-11 07:42:01

解决方案3
0 2022-03-21 03:31:53

如何在 Databricks 的 Iceberg 表上执行 Spark SQL 合并语句？

问题描述

3 个解决方案

解决方案1 2 2021-09-24 20:06:33

解决方案2 0 2021-10-11 07:42:01

解决方案3 0 2022-03-21 03:31:53

解决方案1
2 2021-09-24 20:06:33

解决方案2
0 2021-10-11 07:42:01

解决方案3
0 2022-03-21 03:31:53