简体繁体 English

Spark读分区——资源成本分析

[英]Spark read partitions - Resource cost analysis

原文 2019-07-17 10:34:59 8 1 performance/ apache-spark/ partitioning

When reading data partitioned by column in Spark with something like spark.read.json("/A=1/B=2/C=3/D=4/E=5/") will allow to scan only the files in the folder E=5.当使用spark.read.json("/A=1/B=2/C=3/D=4/E=5/")类的东西读取 Spark 中按列分区的数据时，将只允许扫描文件夹 E=5。

But let's say I am interested to read partitions in which C = my_value through all the data source.但是假设我有兴趣通过所有数据源读取C = my_value分区。 The instruction will be spark.read.json("/*/*/C=my_value/") .指令将是spark.read.json("/*/*/C=my_value/") 。

What happens computationally in the described scenario under the hood?在所描述的场景中计算会发生什么？ Spark will just list through the partition values of A and B? Spark 只会列出 A 和 B 的分区值？ Or it will scan through all the leaves (the actual files) too?或者它也会扫描所有叶子（实际文件）？

1 个解决方案

Thank you for an interesting question.感谢您提出一个有趣的问题。 Apache Spark uses Hadoop's FileSystem abstraction to deal with wildcard patterns.阿帕奇星火采用Hadoop的FileSystem抽象处理通配符模式。 In the source code they're called glob patterns在源代码中，它们被称为glob 模式

The org.apache.hadoop.fs.FileSystem#globStatus(org.apache.hadoop.fs.Path) method is used to return "an array of paths that match the path pattern". org.apache.hadoop.fs.FileSystem#globStatus(org.apache.hadoop.fs.Path)方法用于返回“与路径模式匹配的路径数组”。 This function calls then org.apache.hadoop.fs.Globber#glob to figure out the exact files matching algorithm for the glob pattern.这个函数然后调用org.apache.hadoop.fs.Globber#glob来找出 glob 模式的确切文件匹配算法。 globStatus is called by org.apache.spark.sql.execution.datasources.DataSource#checkAndGlobPathIfNecessary . globStatus 由org.apache.spark.sql.execution.datasources.DataSource#checkAndGlobPathIfNecessary 。 You can add some breakpoints to see how does it work under-the-hood.您可以添加一些断点以查看它在后台是如何工作的。

But long story short:但长话短说：

What happens computationally in the described scenario under the hood?在所描述的场景中计算会发生什么？ Spark will just list through the partition values of A and B? Spark 只会列出 A 和 B 的分区值？ Or it will scan through all the leaves (the actual files) too?或者它也会扫描所有叶子（实际文件）？

Spark will split your glob in 3 parts ["*", "*", "C=my_value"]. Spark 会将您的 glob 分成 3 部分 ["*", "*", "C=my_value"]。 Later, it will list files at every level by using Hadoop org.apache.hadoop.fs.FileSystem#listStatus(org.apache.hadoop.fs.Path) method.稍后，它将使用 Hadoop org.apache.hadoop.fs.FileSystem#listStatus(org.apache.hadoop.fs.Path)方法列出各个级别的文件。 For every file it will build a path and try to match it against the current pattern.对于每个文件，它将构建一个路径并尝试将其与当前模式匹配。 The matching files will be kept as "candidates" that will be filtered out only at the last step, when the algorithm will look for "C=my_value".匹配的文件将作为“候选”保留，仅在最后一步被过滤掉，此时算法将查找“C=my_value”。

Unless you have a lot of files, this operation shouldn't hurt you.除非你有很多文件，否则这个操作应该不会伤害到你。 And probably that's one of the reasons why you should rather keep less but bigger files (famous data engineering problem of "too many small files").也许这就是为什么你应该保留更少但更大的文件的原因之一（著名的“小文件太多”的数据工程问题）。