PySpark：使用选定的列或分区优化从 Delta 读取/加载

Question

I am trying to load data from Delta into a pyspark dataframe.我正在尝试将来自 Delta 的数据加载到 pyspark dataframe 中。

path_to_data = 's3://mybucket/daily_data/'
df = spark.read.format("delta").load(path_to_data)

Now the underlying data is partitioned by date as现在基础数据按日期划分为

s3://mybucket/daily_data/
    dt=2020-06-12
    dt=2020-06-13
    ...
    dt=2020-06-22

Is there a way to optimize the read as Dataframe, given:有没有办法将读取优化为 Dataframe，给出：

Only certain date range is needed只需要特定的日期范围
Subset of column is only needed只需要列的子集

Current way, i tried is:目前，我尝试过的方法是：

df.registerTempTable("my_table")
new_df = spark.sql("select col1,col2 from my_table where dt_col > '2020-06-20' ")
# dt_col is column in dataframe of timestamp dtype.

In the above state, does Spark need to load the whole data, filter the data based on date range and then filter columns needed?在上面的 state 中，Spark 是否需要加载整个数据，根据日期范围过滤数据，然后过滤所需的列？ Is there any optimization that can be done in pyspark read, to load data since it is already partitioned?是否可以在 pyspark 读取中进行任何优化，以加载数据，因为它已经分区？

Something on line of:在线的东西：

df = spark.read.format("delta").load(path_to_data,cols_to_read=['col1','col2'])
or 
df = spark.read.format("delta").load(path_to_data,partitions=[...])

Answer 1

In your case, there is no extra step needed.在您的情况下，不需要额外的步骤。 The optimizations would be taken care by Spark. Spark 将负责优化。 Since you already partitioned the dataset based on column dt when you try to query the dataset with partitioned column dt as filter condition.由于当您尝试使用分区列dt作为过滤条件查询数据集时，您已经根据列dt对数据集进行了分区。 Spark load only the subset of the data from the source dataset which matches the filter condition, in your case it is dt > '2020-06-20' . Spark 仅从源数据集中加载与过滤条件匹配的数据子集，在您的情况下为dt > '2020-06-20' 。

Spark internally does the optimization based partitioning pruning. Spark 在内部进行基于优化的分区修剪。

Answer 2

To do this without SQL..要做到这一点没有 SQL..

from pyspark.sql import functions as F

df = spark.read.format("delta").load(path_to_data).filter(F.col("dt_col") > F.lit('2020-06-20'))

Though for this example you may have some work to do with comparing dates.虽然对于这个例子，你可能需要做一些比较日期的工作。

PySpark：使用选定的列或分区优化从 Delta 读取/加载

问题描述

2 个解决方案

解决方案1
2 2020-06-24 05:17:29

解决方案2
0 2021-12-02 19:00:39

PySpark：使用选定的列或分区优化从 Delta 读取/加载

问题描述

2 个解决方案

解决方案1 2 2020-06-24 05:17:29

解决方案2 0 2021-12-02 19:00:39

解决方案1
2 2020-06-24 05:17:29

解决方案2
0 2021-12-02 19:00:39