简体   繁体   中英

How to insert into Delta table in parallel

I have a process which in short runs 100+ of the same databricks notebook in parallel on a pretty powerful cluster. Each notebook at the end of its process writes roughly 100 rows of data to the same Delta Lake table stored in an Azure Gen1 DataLake. I am seeing extremely long insert times into Delta for what I can only assume is Delta doing some sort of locking the table while an insert occurs and then freeing it up once a single notebook finishes, which based on reading https://docs.databricks.com/delta/concurrency-control.html it is implied that there are no insert conflicts and that multiple writers across multiple clusters can simultaneously insert data.

This insertion for 100 rows per notebook for the 100+ notebook takes over 3 hours. The current code that is causing the bottleneck is:

df.write.format("delta").mode("append").save("<path_>")

Currently there are no partitions on this table which could be a possible fix but before going down this route is there something I am missing in terms of how you get un-conflicted inserts in parallel?

You have to choose between two types of isolation levels for your table and the weaker one is the default, so there is no running away from isolation levels. https://docs.databricks.com/delta/optimizations/isolation-level.html

Delta Lake has OCC (Optimistic Concurrency Control) this means that the data you want to write to your table is validated against all of the data that the other 99 processes want to write. This means that 100*100=10000 validations are being made.https://en.wikipedia.org/wiki/Optimistic_concurrency_control

Please also bear in mind that your data processing architecture will finish when the last notebook of the 100 finishes. Maybe one or multiple of the 100 notebooks takes 3 hours to finish and the insert is not to blame?

If long running notebooks is not the case I would suggest you try to store your result data from each notebook in some sort of data structure (eg store it in 100 files from each notebook) and then batch insert the data of the data structure (eg files) to the destination table.

The data processing will be parallel, the insert will not be parallel.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM