简体   繁体   English

用于 Azure 数据湖的 Spark 谓词下推、过滤和分区修剪

[英]Spark Predicate Push Down, Filtering and Partition Pruning for Azure Data Lake

I had been reading about spark predicates pushdown and partition pruning to understand the amount of data read.我一直在阅读有关 spark 谓词下推和分区修剪的信息,以了解读取的数据量。 I had the following doubts related to the same我有以下与此相关的疑问

Suppose I have a dataset with columns (Year: Int, SchoolName: String, StudentId: Int, SubjectEnrolled: String) of which the data stored on disk is partitioned by Year and SchoolName and stored in parquet format at say azure data lake storage.假设我有一个包含列(Year:Int,SchoolName:String,StudentId:Int,SubjectEnrolled:String)的数据集,其中存储在磁盘上的数据按 Year 和 SchoolName 分区,并以 parquet 格式存储在 azure 数据湖存储中。

1) If I issue a read spark.read(container).filter(Year=2019, SchoolName="XYZ"): 1)如果我发出读取 spark.read(container).filter(Year=2019, SchoolName="XYZ"):

  • Will Partition Pruning come in effect and only a limited number of partitions will be read?分区修剪是否会生效,并且只会读取有限数量的分区?
  • Will there be I/O on blob store and data will be loaded to the Spark cluster and then filtered ie will I have to pay azure for the IO of all other data that we don't need? Blob 存储上是否存在 I/O,并且数据将加载到 Spark 集群然后进行过滤,即我是否必须为 IO 支付我们不需要的所有其他数据的 azure?
  • If not how does azure blob file system understand these filters since it is not queryable by default?如果不是,azure blob 文件系统如何理解这些过滤器,因为默认情况下它不可查询?

2) If I issue a read spark.read(container).filter(StudentId = 43): 2)如果我发出读取 spark.read(container).filter(StudentId = 43):

  • Will spark push the filter to disk still and only read the data that is required? spark 是否仍会将过滤器推送到磁盘并仅读取所需的数据? Since I didn't partition by this, will it understand every row and filter according to the query?由于我没有按此分区,它会理解每一行并根据查询进行过滤吗?
  • Again will I have to pay for IO to azure for all the files that were not required according to the query?对于根据查询不需要的所有文件,我是否需要再次支付 IO 到 azure 的费用?

1) When you use filters on the columns which you did partition on, Spark will skip those files completely and it wouldn't cost you any IO. 1)当您在分区的列上使用过滤器时,Spark 将完全跳过这些文件,并且不会花费您任何 IO。 If you look at your file structure it's stored as something like:如果您查看您的文件结构,它的存储方式如下:

parquet-folder/Year=2019/SchoolName=XYZ/part1.parquet
parquet-folder/Year=2019/SchoolName=XYZ/part2.parquet
parquet-folder/Year=2019/SchoolName=XYZ/...

2) When you filter on some column that isn't in your partition, Spark will scan every part file in every folder of that parquet table. 2)当您过滤不在分区中的某些列时,Spark 将扫描该 parquet 表的每个文件夹中的每个part文件。 Only when you have pushdown filtering, Spark will use the footer of every part file (where min, max and count statistics are stored) to determine if your search value is within that range.只有当您进行下推过滤时,Spark 才会使用每个part文件的页脚(存储 min、max 和 count 统计信息的位置)来确定您的搜索值是否在该范围内。 If yes, Spark will read the file fully.如果是,Spark 将完全读取文件。 If not, Spark will skip the whole file, not costing you at least the full read.如果没有,Spark 将跳过整个文件,至少不会花费您完整的阅读时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM