简体   繁体   English

Delta Lake 表存储分类

[英]Delta Lake Table Storage Sorting

I have a delta lake table and inserting the data into that table.我有一个 delta lake 表并将数据插入到该表中。 Business asked to sort the data while storing it in the table.业务要求在将数据存储到表中时对其进行排序。

I sorted my dataframe before creating the delta table as below在创建增量表之前,我对 dataframe 进行了排序,如下所示

df.sort()

and then created the delta table as below然后如下创建增量表

df.write.format('delta').Option('mergeSchema, true).save('deltalocation')

when retrieving this data into dataframe i see the data is still unsorted.将此数据检索到 dataframe 时,我看到数据仍未排序。

and i have to do df.sort in order to display the sorted data.我必须执行df.sort才能显示排序后的数据。

Per my understanding the data cannot actually be stored in a sorted order and the user will have to write a sorting query while extracting the data from the table.根据我的理解,数据实际上不能按排序顺序存储,用户在从表中提取数据时必须编写排序查询。

I need to understand if this is correct and also how the delta lake internally stores the data.我需要了解这是否正确以及三角洲湖如何在内部存储数据。

My understanding is that it partitions the data and doesn't care about the sort order.我的理解是它对数据进行分区并且不关心排序顺序。 data is spread across multiple partitions.数据分布在多个分区中。

Can someone please clarify this in more detail and advise if my undertanding is correct?有人可以更详细地澄清这一点并告知我的理解是否正确吗?

Delta Lake itself does not itself enable sorting because this would require any engine writing to sort the data. Delta Lake 本身并不支持排序,因为这需要任何引擎写入来对数据进行排序。 To balance simplicity, speed of ingestion, and speed of query, this is why Delta Lake itself does not require or enable sorting per se.为了平衡简单性、摄取速度和查询速度,这就是 Delta Lake 本身不需要或不启用排序的原因。 ie, your statement is correct.也就是说,你的陈述是正确的。

My understanding is that it partitions the data and doesn't care about the sort order.我的理解是它对数据进行分区并且不关心排序顺序。 data is spread across multiple partitions.数据分布在多个分区中。

Note that Delta Lake includes data skipping and OPTIMIZE ZORDER .请注意,Delta Lake 包括数据跳过和OPTIMIZE ZORDER This allows you to skip files/data using the column statistics and by clustering the data.这允许您使用列统计信息和通过聚类数据来跳过文件/数据。 While sorting can be helpful for a single column, Z-order provides better multi-column data cluster.虽然排序对单列很有帮助,但 Z 顺序提供了更好的多列数据集群。 More info is available in Delta 2.0 - The Foundation of your Data Lakehouse is Open . Delta 2.0 中提供了更多信息 - 您的 Data Lakehouse 的基础是开放的

Saying this, how Delta Lake stores the data is often a product of what the writer itself is doing.话虽如此,Delta Lake 如何存储数据往往是作者自己在做什么的产物。 If you were to specify a sort during the write phase, eg:如果您要在写入阶段指定一个排序,例如:

df_sorted = df.repartition("date").sortWithinPartitions("date", "id")
df_sorted.write.format("delta").partitionBy("date").save('deltalocation')

Then the data should be sorted and when read it will be sorted as well.然后应该对数据进行排序,并且在读取时也会对数据进行排序。


In response to the question about the potential order, allow me to provide a simple example:关于潜在订单的问题,请允许我提供一个简单的例子:

from pyspark.sql.functions import expr
data = spark.range(0, 100)
df = data.withColumn("mod", expr("mod(id, 10)")).show()

# Write unsorted table
df.write.format("delta").partitionBy("mod").save("/tmp/df")

# Sort within partitions
df_sorted = df.repartition("mod").sortWithinPartitions("mod", "id")

# Write sorted table
df_sorted.write.format("delta").partitionBy("mod").save("/tmp/df_sorted")

The two data frames have been saved as Delta tables to their respective df and df_sorted locations.这两个数据框已作为 Delta 表保存到它们各自的dfdf_sorted位置。

You can read the data by the following:您可以通过以下方式读取数据:

# Load data
spark.read.format("delta").load("/tmp/df").show()
spark.read.format("delta").load("/tmp/df").orderBy("mod").show()

spark.read.format("delta").load("/tmp/df_sorted").show()
spark.read.format("delta").load("/tmp/df_sorted").orderBy("mod").show()

For the un-sorted query, here are the first 20 rows and as expected, the data is not sorted.对于未排序的查询,这里是前 20 行,正如预期的那样,数据未排序。

+---+---+
| id|mod|
+---+---+
| 63|  3|
| 73|  3|
| 83|  3|
| 93|  3|
|  3|  3|
| 13|  3|
| 23|  3|
| 33|  3|
| 43|  3|
| 53|  3|
| 88|  8|
| 98|  8|
| 28|  8|
| 38|  8|
| 48|  8|
| 58|  8|
|  8|  8|
| 18|  8|
| 68|  8|
| 78|  8|
+---+---+

But in the case of df_sorted :但在df_sorted的情况下:

+---+---+
| id|mod|
+---+---+
|  2|  2|
| 12|  2|
| 22|  2|
| 32|  2|
| 42|  2|
| 52|  2|
| 62|  2|
| 72|  2|
| 82|  2|
| 92|  2|
|  9|  9|
| 19|  9|
| 29|  9|
| 39|  9|
| 49|  9|
| 59|  9|
| 69|  9|
| 79|  9|
| 89|  9|
| 99|  9|
+---+---+

As noted, the data within the partitions are sorted.如前所述,分区内的数据已排序。 The partitions themselves are not sorted because different worker threads will extract the data by different partitions so there is no guarantee of the order of the partitions unless you explicitly specify a sort order of the partitions.分区本身没有排序,因为不同的工作线程将按不同的分区提取数据,因此无法保证分区的顺序,除非您明确指定分区的排序顺序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM