[英]What is the best practice to load a delta table specific partition in databricks?
I would like to know what is the best way to load a delta table specific partition?我想知道加载增量表特定分区的最佳方法是什么? Is option 2 loading the all table before filtering?
选项 2 是在过滤之前加载所有表吗?
df = spark.read.format("delta").option('basePath','/mnt/raw/mytable/')\
.load('/mnt/raw/mytable/ingestdate=20210703')
(Is the basePath option needed here?) (这里需要 basePath 选项吗?)
df = spark.read.format("delta").load('/mnt/raw/mytable/')
df = df.filter(col('ingestdate')=='20210703')
Many thanks in advance !提前谢谢了 !
In the second option, spark loads only the relevant partitions that has been mentioned on the filter condition, internally spark does partition pruning
and load only the relevant data from source table.在第二个选项中,spark 仅加载过滤条件中提到的相关分区,内部 spark 进行
partition pruning
并仅加载源表中的相关数据。
Whereas in the first option, you are directly instructing spark to load only the respective partitions as defined.而在第一个选项中,您直接指示 spark 仅加载定义的相应分区。
So in both the cases, you will end up loading only the respective partitions data.所以在这两种情况下,你最终只会加载各自的分区数据。
If your table is partitioned and you want to read just one partition, you can do it using where如果您的表已分区并且您只想读取一个分区,则可以使用where
val partition = "year = '2019'"
val df = spark.read
.format("delta")
.load(path)
.where(partition)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.