[英]PySpark : Optimize read/load from Delta using selected columns or partitions
I am trying to load data from Delta into a pyspark dataframe.我正在尝试将来自 Delta 的数据加载到 pyspark dataframe 中。
path_to_data = 's3://mybucket/daily_data/'
df = spark.read.format("delta").load(path_to_data)
Now the underlying data is partitioned by date as现在基础数据按日期划分为
s3://mybucket/daily_data/
dt=2020-06-12
dt=2020-06-13
...
dt=2020-06-22
Is there a way to optimize the read as Dataframe, given:有没有办法将读取优化为 Dataframe,给出:
Current way, i tried is:目前,我尝试过的方法是:
df.registerTempTable("my_table")
new_df = spark.sql("select col1,col2 from my_table where dt_col > '2020-06-20' ")
# dt_col is column in dataframe of timestamp dtype.
In the above state, does Spark need to load the whole data, filter the data based on date range and then filter columns needed?在上面的 state 中,Spark 是否需要加载整个数据,根据日期范围过滤数据,然后过滤所需的列? Is there any optimization that can be done in pyspark read, to load data since it is already partitioned?
是否可以在 pyspark 读取中进行任何优化,以加载数据,因为它已经分区?
Something on line of:在线的东西:
df = spark.read.format("delta").load(path_to_data,cols_to_read=['col1','col2'])
or
df = spark.read.format("delta").load(path_to_data,partitions=[...])
In your case, there is no extra step needed.在您的情况下,不需要额外的步骤。 The optimizations would be taken care by Spark.
Spark 将负责优化。 Since you already partitioned the dataset based on column
dt
when you try to query the dataset with partitioned column dt
as filter condition.由于当您尝试使用分区列
dt
作为过滤条件查询数据集时,您已经根据列dt
对数据集进行了分区。 Spark load only the subset of the data from the source dataset which matches the filter condition, in your case it is dt > '2020-06-20'
. Spark 仅从源数据集中加载与过滤条件匹配的数据子集,在您的情况下为
dt > '2020-06-20'
。
Spark internally does the optimization based partitioning pruning. Spark 在内部进行基于优化的分区修剪。
To do this without SQL..要做到这一点没有 SQL..
from pyspark.sql import functions as F
df = spark.read.format("delta").load(path_to_data).filter(F.col("dt_col") > F.lit('2020-06-20'))
Though for this example you may have some work to do with comparing dates.虽然对于这个例子,你可能需要做一些比较日期的工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.