简体   繁体   English

在数据块中加载增量表特定分区的最佳实践是什么?

[英]What is the best practice to load a delta table specific partition in databricks?

I would like to know what is the best way to load a delta table specific partition?我想知道加载增量表特定分区的最佳方法是什么? Is option 2 loading the all table before filtering?选项 2 是在过滤之前加载所有表吗?

option 1:选项1:

df = spark.read.format("delta").option('basePath','/mnt/raw/mytable/')\
   .load('/mnt/raw/mytable/ingestdate=20210703')

(Is the basePath option needed here?) (这里需要 basePath 选项吗?)

option 2:选项2:

df = spark.read.format("delta").load('/mnt/raw/mytable/')
df = df.filter(col('ingestdate')=='20210703')

Many thanks in advance !提前谢谢了 !

In the second option, spark loads only the relevant partitions that has been mentioned on the filter condition, internally spark does partition pruning and load only the relevant data from source table.在第二个选项中,spark 仅加载过滤条件中提到的相关分区,内部 spark 进行partition pruning并仅加载源表中的相关数据。

Whereas in the first option, you are directly instructing spark to load only the respective partitions as defined.而在第一个选项中,您直接指示 spark 仅加载定义的相应分区。

So in both the cases, you will end up loading only the respective partitions data.所以在这两种情况下,你最终只会加载各自的分区数据。

If your table is partitioned and you want to read just one partition, you can do it using where如果您的表已分区并且您只想读取一个分区,则可以使用where

val partition = "year = '2019'"


val df = spark.read
 .format("delta")
 .load(path)
 .where(partition)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Databricks Delta 表加载需要很长时间才能加载 1 个记录 - Databricks Delta table load take long time to load 1 recod 循环文件夹内的文件并使用数据块将数据加载到增量表中 - Looping the files inside the folder and load data into delta table using databricks Databricks Delta和Hive交易表 - Databricks Delta and Hive Transactional Table Databricks 增量表架构不匹配 - Databricks Delta Table Schema mismatch spark delta 覆盖特定分区 - spark delta overwrite a specific partition 什么是数据块火花增量表? 它们是否还存储特定会话的数据以及如何查看这些增量表及其结构 - What are databricks spark delta tables? Does they also stores data for a specific session and how can I view these delta tables and their structure 如何从 Databricks Delta 表中删除列? - How to drop a column from a Databricks Delta table? databricks 中的 Delta 湖 - 为现有存储创建表 - Delta lake in databricks - creating a table for existing storage 三角洲湖中使用的 spark.databricks.delta.snapshotPartitions 配置是什么? - what is spark.databricks.delta.snapshotPartitions configuration used for in delta lake? Databricks/python - 创建强大的长期运行作业的最佳实践方法是什么 - Databricks/python - what is a best practice approach to create a robust long running job
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM