简体繁体中英

How to insert into Delta table in parallel

原文 2020-09-14 12:34:07 0 1 apache-spark/ pyspark/ databricks/ azure-databricks/ delta-lake

I have a process which in short runs 100+ of the same databricks notebook in parallel on a pretty powerful cluster. Each notebook at the end of its process writes roughly 100 rows of data to the same Delta Lake table stored in an Azure Gen1 DataLake. I am seeing extremely long insert times into Delta for what I can only assume is Delta doing some sort of locking the table while an insert occurs and then freeing it up once a single notebook finishes, which based on reading https://docs.databricks.com/delta/concurrency-control.html it is implied that there are no insert conflicts and that multiple writers across multiple clusters can simultaneously insert data.

This insertion for 100 rows per notebook for the 100+ notebook takes over 3 hours. The current code that is causing the bottleneck is:

df.write.format("delta").mode("append").save("<path_>")

Currently there are no partitions on this table which could be a possible fix but before going down this route is there something I am missing in terms of how you get un-conflicted inserts in parallel?

1 answers

You have to choose between two types of isolation levels for your table and the weaker one is the default, so there is no running away from isolation levels. https://docs.databricks.com/delta/optimizations/isolation-level.html

Delta Lake has OCC (Optimistic Concurrency Control) this means that the data you want to write to your table is validated against all of the data that the other 99 processes want to write. This means that 100*100=10000 validations are being made.https://en.wikipedia.org/wiki/Optimistic_concurrency_control

Please also bear in mind that your data processing architecture will finish when the last notebook of the 100 finishes. Maybe one or multiple of the 100 notebooks takes 3 hours to finish and the insert is not to blame?

If long running notebooks is not the case I would suggest you try to store your result data from each notebook in some sort of data structure (eg store it in 100 files from each notebook) and then batch insert the data of the data structure (eg files) to the destination table.

The data processing will be parallel, the insert will not be parallel.

Insert or Update a delta table from a dataframe in Pyspark

How to drop duplicates in Delta Table?

how to insert the data from delta table to a variable in order to apply drools rule on them

How to get the latest insertion time for a delta table?

How to drop a column from a Databricks Delta table?

How to roll back delta table to previous version

Pyspark: Delta table as stream source, How to do it?

How to add a new column to a Delta Lake table?

How to CREATE TABLE USING delta with Spark 2.4.4?

How to add complex logic to updateExpr in a Delta Table

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Insert or Update a delta table from a dataframe in Pyspark How to drop duplicates in Delta Table? how to insert the data from delta table to a variable in order to apply drools rule on them How to get the latest insertion time for a delta table? How to drop a column from a Databricks Delta table? How to roll back delta table to previous version Pyspark: Delta table as stream source, How to do it? How to add a new column to a Delta Lake table? How to CREATE TABLE USING delta with Spark 2.4.4? How to add complex logic to updateExpr in a Delta Table

Related Tags

How to insert into Delta table in parallel

Question

1 answers

solution1 1 ACCPTED 2020-09-14 20:04:55

solution1
1 ACCPTED 2020-09-14 20:04:55