[英]Spark Predicate Push Down, Filtering and Partition Pruning for Azure Data Lake
I had been reading about spark predicates pushdown and partition pruning to understand the amount of data read.我一直在阅读有关 spark 谓词下推和分区修剪的信息,以了解读取的数据量。 I had the following doubts related to the same
我有以下与此相关的疑问
Suppose I have a dataset with columns (Year: Int, SchoolName: String, StudentId: Int, SubjectEnrolled: String) of which the data stored on disk is partitioned by Year and SchoolName and stored in parquet format at say azure data lake storage.假设我有一个包含列(Year:Int,SchoolName:String,StudentId:Int,SubjectEnrolled:String)的数据集,其中存储在磁盘上的数据按 Year 和 SchoolName 分区,并以 parquet 格式存储在 azure 数据湖存储中。
1) If I issue a read spark.read(container).filter(Year=2019, SchoolName="XYZ"): 1)如果我发出读取 spark.read(container).filter(Year=2019, SchoolName="XYZ"):
2) If I issue a read spark.read(container).filter(StudentId = 43): 2)如果我发出读取 spark.read(container).filter(StudentId = 43):
1) When you use filters on the columns which you did partition on, Spark will skip those files completely and it wouldn't cost you any IO. 1)当您在分区的列上使用过滤器时,Spark 将完全跳过这些文件,并且不会花费您任何 IO。 If you look at your file structure it's stored as something like:
如果您查看您的文件结构,它的存储方式如下:
parquet-folder/Year=2019/SchoolName=XYZ/part1.parquet
parquet-folder/Year=2019/SchoolName=XYZ/part2.parquet
parquet-folder/Year=2019/SchoolName=XYZ/...
2) When you filter on some column that isn't in your partition, Spark will scan every part
file in every folder of that parquet table. 2)当您过滤不在分区中的某些列时,Spark 将扫描该 parquet 表的每个文件夹中的每个
part
文件。 Only when you have pushdown filtering, Spark will use the footer of every part
file (where min, max and count statistics are stored) to determine if your search value is within that range.只有当您进行下推过滤时,Spark 才会使用每个
part
文件的页脚(存储 min、max 和 count 统计信息的位置)来确定您的搜索值是否在该范围内。 If yes, Spark will read the file fully.如果是,Spark 将完全读取文件。 If not, Spark will skip the whole file, not costing you at least the full read.
如果没有,Spark 将跳过整个文件,至少不会花费您完整的阅读时间。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.