简体   繁体   中英

Delta Lake Table Storage Sorting

I have a delta lake table and inserting the data into that table. Business asked to sort the data while storing it in the table.

I sorted my dataframe before creating the delta table as below

df.sort()

and then created the delta table as below

df.write.format('delta').Option('mergeSchema, true).save('deltalocation')

when retrieving this data into dataframe i see the data is still unsorted.

and i have to do df.sort in order to display the sorted data.

Per my understanding the data cannot actually be stored in a sorted order and the user will have to write a sorting query while extracting the data from the table.

I need to understand if this is correct and also how the delta lake internally stores the data.

My understanding is that it partitions the data and doesn't care about the sort order. data is spread across multiple partitions.

Can someone please clarify this in more detail and advise if my undertanding is correct?

Delta Lake itself does not itself enable sorting because this would require any engine writing to sort the data. To balance simplicity, speed of ingestion, and speed of query, this is why Delta Lake itself does not require or enable sorting per se. ie, your statement is correct.

My understanding is that it partitions the data and doesn't care about the sort order. data is spread across multiple partitions.

Note that Delta Lake includes data skipping and OPTIMIZE ZORDER . This allows you to skip files/data using the column statistics and by clustering the data. While sorting can be helpful for a single column, Z-order provides better multi-column data cluster. More info is available in Delta 2.0 - The Foundation of your Data Lakehouse is Open .

Saying this, how Delta Lake stores the data is often a product of what the writer itself is doing. If you were to specify a sort during the write phase, eg:

df_sorted = df.repartition("date").sortWithinPartitions("date", "id")
df_sorted.write.format("delta").partitionBy("date").save('deltalocation')

Then the data should be sorted and when read it will be sorted as well.


In response to the question about the potential order, allow me to provide a simple example:

from pyspark.sql.functions import expr
data = spark.range(0, 100)
df = data.withColumn("mod", expr("mod(id, 10)")).show()

# Write unsorted table
df.write.format("delta").partitionBy("mod").save("/tmp/df")

# Sort within partitions
df_sorted = df.repartition("mod").sortWithinPartitions("mod", "id")

# Write sorted table
df_sorted.write.format("delta").partitionBy("mod").save("/tmp/df_sorted")

The two data frames have been saved as Delta tables to their respective df and df_sorted locations.

You can read the data by the following:

# Load data
spark.read.format("delta").load("/tmp/df").show()
spark.read.format("delta").load("/tmp/df").orderBy("mod").show()

spark.read.format("delta").load("/tmp/df_sorted").show()
spark.read.format("delta").load("/tmp/df_sorted").orderBy("mod").show()

For the un-sorted query, here are the first 20 rows and as expected, the data is not sorted.

+---+---+
| id|mod|
+---+---+
| 63|  3|
| 73|  3|
| 83|  3|
| 93|  3|
|  3|  3|
| 13|  3|
| 23|  3|
| 33|  3|
| 43|  3|
| 53|  3|
| 88|  8|
| 98|  8|
| 28|  8|
| 38|  8|
| 48|  8|
| 58|  8|
|  8|  8|
| 18|  8|
| 68|  8|
| 78|  8|
+---+---+

But in the case of df_sorted :

+---+---+
| id|mod|
+---+---+
|  2|  2|
| 12|  2|
| 22|  2|
| 32|  2|
| 42|  2|
| 52|  2|
| 62|  2|
| 72|  2|
| 82|  2|
| 92|  2|
|  9|  9|
| 19|  9|
| 29|  9|
| 39|  9|
| 49|  9|
| 59|  9|
| 69|  9|
| 79|  9|
| 89|  9|
| 99|  9|
+---+---+

As noted, the data within the partitions are sorted. The partitions themselves are not sorted because different worker threads will extract the data by different partitions so there is no guarantee of the order of the partitions unless you explicitly specify a sort order of the partitions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM