在数据块中加载增量表特定分区的最佳实践是什么？

Question

I would like to know what is the best way to load a delta table specific partition?我想知道加载增量表特定分区的最佳方法是什么？ Is option 2 loading the all table before filtering?选项 2 是在过滤之前加载所有表吗？

option 1:选项1：

df = spark.read.format("delta").option('basePath','/mnt/raw/mytable/')\
   .load('/mnt/raw/mytable/ingestdate=20210703')

(Is the basePath option needed here?) （这里需要 basePath 选项吗？）

option 2:选项2：

df = spark.read.format("delta").load('/mnt/raw/mytable/')
df = df.filter(col('ingestdate')=='20210703')

Many thanks in advance !提前谢谢了！

Answer 1

In the second option, spark loads only the relevant partitions that has been mentioned on the filter condition, internally spark does partition pruning and load only the relevant data from source table.在第二个选项中，spark 仅加载过滤条件中提到的相关分区，内部 spark 进行partition pruning并仅加载源表中的相关数据。

Whereas in the first option, you are directly instructing spark to load only the respective partitions as defined.而在第一个选项中，您直接指示 spark 仅加载定义的相应分区。

So in both the cases, you will end up loading only the respective partitions data.所以在这两种情况下，你最终只会加载各自的分区数据。

Answer 2

If your table is partitioned and you want to read just one partition, you can do it using where如果您的表已分区并且您只想读取一个分区，则可以使用where

val partition = "year = '2019'"


val df = spark.read
 .format("delta")
 .load(path)
 .where(partition)

在数据块中加载增量表特定分区的最佳实践是什么？

问题描述

option 1:选项1：

option 2:选项2：

2 个解决方案

解决方案1
1 已采纳 2021-07-13 05:47:57

解决方案2
0 2022-09-16 13:09:20

在数据块中加载增量表特定分区的最佳实践是什么？

问题描述

option 1:选项1：

option 2:选项2：

2 个解决方案

解决方案1 1 已采纳 2021-07-13 05:47:57

解决方案2 0 2022-09-16 13:09:20

解决方案1
1 已采纳 2021-07-13 05:47:57

解决方案2
0 2022-09-16 13:09:20