简体   繁体   English

从 azure 数据块中的多任务作业写入分区增量表时出错

[英]Error writing a partitioned Delta Table from a multitasking job in azure databricks

I have a notebook that writes a delta table with a statement similar to the following:我有一个笔记本,上面写了一个增量表,其中包含类似于以下的语句:

match = "current.country = updates.country and current.process_date = updates.process_date"
deltaTable = DeltaTable.forPath(spark, silver_path)
deltaTable.alias("current")\
.merge(
    data.alias("updates"),
    match) \
  .whenMatchedUpdate(
      set = update_set,
      condition = condition) \
  .whenNotMatchedInsert(values = values_set)\
  .execute()

The multitask job has two tasks that are executed in parallel.多任务作业有两个并行执行的任务。 在此处输入图像描述

When executing the job the following error is displayed:执行作业时显示以下错误:

ConcurrentAppendException: Files were added to partition [country=Panamá, process_date=2022-01-01 00:00:00] by a concurrent update. ConcurrentAppendException:文件已通过并发更新添加到分区 [country=Panamá,process_date=2022-01-01 00:00:00]。 Please try the operation again.请重试该操作。

In each task I send different countries (Panama, Ecuador) and the same date as a parameter, so when executing only the information corresponding to the country sent should be written.在每个任务中,我发送不同的国家(巴拿马、厄瓜多尔)和相同的日期作为参数,因此在执行时只应写入与发送的国家相对应的信息。 This delta table is partitioned by the country and process_date fields.此增量表按 country 和 process_date 字段分区。 Any ideas what I'm doing wrong?知道我做错了什么吗? How should I specify the partition to be affected when using the "merge" statement?使用“merge”语句时应该如何指定受影响的分区?

I appreciate if you can clarify how I should work with the partitions in these cases, since this is new to me.如果您能阐明在这些情况下我应该如何使用分区,我将不胜感激,因为这对我来说是新的。

Update: I made an adjustment in the condition to specify the country and process date according to what is indicated here (ConcurrentAppendException) .更新:我根据 此处指示的内容对条件进行了调整以指定国家和流程日期 (ConcurrentAppendException) Now I get the following error message:现在我收到以下错误消息:

ConcurrentAppendException: Files were added to the root of the table by a concurrent update. ConcurrentAppendException:文件已通过并发更新添加到表的根目录中。 Please try the operation again.请重试该操作。

I can't think what could cause the error.我想不出是什么导致了错误。 Keep investigating.继续调查。

Error – ConcurrentAppendException: Files were added to the root of the table by a concurrent update.错误 – ConcurrentAppendException:文件已通过并发更新添加到表的根目录中。 Please try the operation again.请重试该操作。

This exception is often thrown during concurrent DELETE, UPDATE, or MERGE operations.此异常通常在并发 DELETE、UPDATE 或 MERGE 操作期间抛出。 While the concurrent operations may be physically updating different partition directories, one of them may read the same partition that the other one concurrently updates, thus causing a conflict.虽然并发操作可能在物理上更新不同的分区目录,但其中一个可能会读取另一个同时更新的同一分区,从而导致冲突。 You can avoid this by making the separation explicit in the operation condition.您可以通过在操作条件中明确分隔来避免这种情况。

Update query would be executed for the Delta Lake target table when 'Update Strategy' transformation is used in the mapping.当在映射中使用“更新策略”转换时,将对 Delta Lake 目标表执行更新查询。 When multiple Update Strategy transformations are used for the same target table, multiple UPDATE queries would be executed in parallel and hence, target data would be unpredictable.当多个更新策略转换用于同一个目标表时,多个更新查询将并行执行,因此目标数据将不可预测。 Due to the unpredictable data scenario in Delta Lake target for concurrent UPDATE queries, it is not supported to use more than one 'Update Strategy' transformation per 'Databricks Delta Lake Table' in a mapping.由于并发 UPDATE 查询的 Delta Lake 目标中不可预测的数据场景,不支持在映射中为每个“Databricks Delta Lake Table”使用多个“更新策略”转换。 Redesign the mapping such that there is one 'Update Strategy' transformation per Delta Lake table.重新设计映射,使每个 Delta Lake 表有一个“更新策略”转换。

Solution -解决方案 -

While running a mapping with one 'Update Strategy' transformation per Databricks Delta Lake table, execution would complete successfully.在为每个 Databricks Delta Lake 表运行带有一个“更新策略”转换的映射时,执行将成功完成。

Refer - https://docs.delta.io/latest/concurrency-control.html#avoid-conflicts-using-partitioning-and-disjoint-command-conditions参考 - https://docs.delta.io/latest/concurrency-control.html#avoid-conflicts-using-partitioning-and-disjoint-command-conditions

Initially, the affected table only had a date field as partition.最初,受影响的表只有一个日期字段作为分区。 So I partitioned it with country and date fields.所以我用国家和日期字段对它进行了分区。 This new partition created the country and date directories however the old directories of the date partition remained and were not deleted.这个新分区创建了国家和日期目录,但是日期分区的旧目录仍然存在并且没有被删除。

在此处输入图像描述

Apparently these directories were causing the conflict when trying to read them concurrently.显然,这些目录在尝试同时读取它们时导致了冲突。 I created a new delta on another path with the correct partitions and then replaced it on the original path.我在具有正确分区的另一条路径上创建了一个新的增量,然后将其替换在原始路径上。 This allowed old partition directories to be removed.这允许删除旧的分区目录。

在此处输入图像描述

The only consequence of performing these actions was that I lost the change history of the table (time travel).执行这些操作的唯一后果是我丢失了表的更改历史记录(时间旅行)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM