简体繁体 English

如何并行插入Delta表

[英]How to insert into Delta table in parallel

原文 2020-09-14 12:34:07 3 1 apache-spark/ pyspark/ databricks/ azure-databricks/ delta-lake

I have a process which in short runs 100+ of the same databricks notebook in parallel on a pretty powerful cluster.我有一个进程可以在一个非常强大的集群上并行运行 100 多个相同的 databricks 笔记本。 Each notebook at the end of its process writes roughly 100 rows of data to the same Delta Lake table stored in an Azure Gen1 DataLake.每个笔记本在其进程结束时将大约 100 行数据写入存储在 Azure Gen1 DataLake 中的同一个 Delta Lake 表。 I am seeing extremely long insert times into Delta for what I can only assume is Delta doing some sort of locking the table while an insert occurs and then freeing it up once a single notebook finishes, which based on reading https://docs.databricks.com/delta/concurrency-control.html it is implied that there are no insert conflicts and that multiple writers across multiple clusters can simultaneously insert data.我看到 Delta 中的插入时间非常长，因为我只能假设 Delta 在插入时执行某种锁定表，然后在单个笔记本完成后将其释放，这基于阅读https://docs.databricks .com/delta/concurrency-control.html暗示不存在插入冲突，并且跨多个集群的多个编写器可以同时插入数据。

This insertion for 100 rows per notebook for the 100+ notebook takes over 3 hours.对于 100 多个笔记本，每个笔记本插入 100 行需要 3 多个小时。 The current code that is causing the bottleneck is:导致瓶颈的当前代码是：

df.write.format("delta").mode("append").save("<path_>") df.write.format("delta").mode("append").save("<path_>")

Currently there are no partitions on this table which could be a possible fix but before going down this route is there something I am missing in terms of how you get un-conflicted inserts in parallel?目前这张表上没有分区，这可能是一个可能的修复，但在沿着这条路线走之前，我在如何并行获得无冲突的插入方面有什么遗漏吗？

1 个解决方案

You have to choose between two types of isolation levels for your table and the weaker one is the default, so there is no running away from isolation levels.您必须为您的表选择两种类型的隔离级别，较弱的一种是默认值，因此不会逃避隔离级别。 https://docs.databricks.com/delta/optimizations/isolation-level.html https://docs.databricks.com/delta/optimizations/isolation-level.html

Delta Lake has OCC (Optimistic Concurrency Control) this means that the data you want to write to your table is validated against all of the data that the other 99 processes want to write. Delta Lake 具有 OCC（乐观并发控制），这意味着您要写入表的数据会根据其他 99 个进程要写入的所有数据进行验证。 This means that 100*100=10000 validations are being made.这意味着正在进行 100*100=10000 次验证。https://en.wikipedia.org/wiki/Optimistic_concurrency_controlhttps://en.wikipedia.org/wiki/Optimistic_concurrency_control

Please also bear in mind that your data processing architecture will finish when the last notebook of the 100 finishes.还请记住，您的数据处理架构将在 100 个笔记本中的最后一个完成时完成。 Maybe one or multiple of the 100 notebooks takes 3 hours to finish and the insert is not to blame?也许 100 个笔记本中的一个或多个需要 3 个小时才能完成，而插入不应该受到责备？

If long running notebooks is not the case I would suggest you try to store your result data from each notebook in some sort of data structure (eg store it in 100 files from each notebook) and then batch insert the data of the data structure (eg files) to the destination table.如果长时间运行的笔记本不是这种情况，我建议您尝试将每个笔记本的结果数据存储在某种数据结构中（例如将其存储在每个笔记本的 100 个文件中），然后批量插入数据结构的数据（例如files) 到目标表。

The data processing will be parallel, the insert will not be parallel.数据处理是并行的，插入不是并行的。