[英]Spark read partitions - Resource cost analysis
When reading data partitioned by column in Spark with something like spark.read.json("/A=1/B=2/C=3/D=4/E=5/")
will allow to scan only the files in the folder E=5.当使用
spark.read.json("/A=1/B=2/C=3/D=4/E=5/")
类的东西读取 Spark 中按列分区的数据时,将只允许扫描文件夹 E=5。
But let's say I am interested to read partitions in which C = my_value
through all the data source.但是假设我有兴趣通过所有数据源读取
C = my_value
分区。 The instruction will be spark.read.json("/*/*/C=my_value/")
.指令将是
spark.read.json("/*/*/C=my_value/")
。
What happens computationally in the described scenario under the hood?在所描述的场景中计算会发生什么? Spark will just list through the partition values of A and B?
Spark 只会列出 A 和 B 的分区值? Or it will scan through all the leaves (the actual files) too?
或者它也会扫描所有叶子(实际文件)?
Thank you for an interesting question.感谢您提出一个有趣的问题。 Apache Spark uses Hadoop's
FileSystem
abstraction to deal with wildcard patterns.阿帕奇星火采用Hadoop的
FileSystem
抽象处理通配符模式。 In the source code they're called glob patterns在源代码中,它们被称为glob 模式
The org.apache.hadoop.fs.FileSystem#globStatus(org.apache.hadoop.fs.Path)
method is used to return "an array of paths that match the path pattern". org.apache.hadoop.fs.FileSystem#globStatus(org.apache.hadoop.fs.Path)
方法用于返回“与路径模式匹配的路径数组”。 This function calls then org.apache.hadoop.fs.Globber#glob
to figure out the exact files matching algorithm for the glob pattern.这个函数然后调用
org.apache.hadoop.fs.Globber#glob
来找出 glob 模式的确切文件匹配算法。 globStatus is called by org.apache.spark.sql.execution.datasources.DataSource#checkAndGlobPathIfNecessary
. globStatus 由
org.apache.spark.sql.execution.datasources.DataSource#checkAndGlobPathIfNecessary
。 You can add some breakpoints to see how does it work under-the-hood.您可以添加一些断点以查看它在后台是如何工作的。
But long story short:但长话短说:
What happens computationally in the described scenario under the hood?
在所描述的场景中计算会发生什么? Spark will just list through the partition values of A and B?
Spark 只会列出 A 和 B 的分区值? Or it will scan through all the leaves (the actual files) too?
或者它也会扫描所有叶子(实际文件)?
Spark will split your glob in 3 parts ["*", "*", "C=my_value"]. Spark 会将您的 glob 分成 3 部分 ["*", "*", "C=my_value"]。 Later, it will list files at every level by using Hadoop
org.apache.hadoop.fs.FileSystem#listStatus(org.apache.hadoop.fs.Path)
method.稍后,它将使用 Hadoop
org.apache.hadoop.fs.FileSystem#listStatus(org.apache.hadoop.fs.Path)
方法列出各个级别的文件。 For every file it will build a path and try to match it against the current pattern.对于每个文件,它将构建一个路径并尝试将其与当前模式匹配。 The matching files will be kept as "candidates" that will be filtered out only at the last step, when the algorithm will look for "C=my_value".
匹配的文件将作为“候选”保留,仅在最后一步被过滤掉,此时算法将查找“C=my_value”。
Unless you have a lot of files, this operation shouldn't hurt you.除非你有很多文件,否则这个操作应该不会伤害到你。 And probably that's one of the reasons why you should rather keep less but bigger files (famous data engineering problem of "too many small files").
也许这就是为什么你应该保留更少但更大的文件的原因之一(著名的“小文件太多”的数据工程问题)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.