简体   繁体   中英

Spark read partitions - Resource cost analysis

When reading data partitioned by column in Spark with something like spark.read.json("/A=1/B=2/C=3/D=4/E=5/") will allow to scan only the files in the folder E=5.

But let's say I am interested to read partitions in which C = my_value through all the data source. The instruction will be spark.read.json("/*/*/C=my_value/") .

What happens computationally in the described scenario under the hood? Spark will just list through the partition values of A and B? Or it will scan through all the leaves (the actual files) too?

Thank you for an interesting question. Apache Spark uses Hadoop's FileSystem abstraction to deal with wildcard patterns. In the source code they're called glob patterns

The org.apache.hadoop.fs.FileSystem#globStatus(org.apache.hadoop.fs.Path) method is used to return "an array of paths that match the path pattern". This function calls then org.apache.hadoop.fs.Globber#glob to figure out the exact files matching algorithm for the glob pattern. globStatus is called by org.apache.spark.sql.execution.datasources.DataSource#checkAndGlobPathIfNecessary . You can add some breakpoints to see how does it work under-the-hood.

But long story short:

What happens computationally in the described scenario under the hood? Spark will just list through the partition values of A and B? Or it will scan through all the leaves (the actual files) too?

Spark will split your glob in 3 parts ["*", "*", "C=my_value"]. Later, it will list files at every level by using Hadoop org.apache.hadoop.fs.FileSystem#listStatus(org.apache.hadoop.fs.Path) method. For every file it will build a path and try to match it against the current pattern. The matching files will be kept as "candidates" that will be filtered out only at the last step, when the algorithm will look for "C=my_value".

Unless you have a lot of files, this operation shouldn't hurt you. And probably that's one of the reasons why you should rather keep less but bigger files (famous data engineering problem of "too many small files").

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM