简体   繁体   English

Spark读分区——资源成本分析

[英]Spark read partitions - Resource cost analysis

When reading data partitioned by column in Spark with something like spark.read.json("/A=1/B=2/C=3/D=4/E=5/") will allow to scan only the files in the folder E=5.当使用spark.read.json("/A=1/B=2/C=3/D=4/E=5/")类的东西读取 Spark 中按列分区的数据时,将只允许扫描文件夹 E=5。

But let's say I am interested to read partitions in which C = my_value through all the data source.但是假设我有兴趣通过所有数据源读取C = my_value分区。 The instruction will be spark.read.json("/*/*/C=my_value/") .指令将是spark.read.json("/*/*/C=my_value/")

What happens computationally in the described scenario under the hood?在所描述的场景中计算会发生什么? Spark will just list through the partition values of A and B? Spark 只会列出 A 和 B 的分区值? Or it will scan through all the leaves (the actual files) too?或者它也会扫描所有叶子(实际文件)?

Thank you for an interesting question.感谢您提出一个有趣的问题。 Apache Spark uses Hadoop's FileSystem abstraction to deal with wildcard patterns.阿帕奇星火采用Hadoop的FileSystem抽象处理通配符模式。 In the source code they're called glob patterns在源代码中,它们被称为glob 模式

The org.apache.hadoop.fs.FileSystem#globStatus(org.apache.hadoop.fs.Path) method is used to return "an array of paths that match the path pattern". org.apache.hadoop.fs.FileSystem#globStatus(org.apache.hadoop.fs.Path)方法用于返回“与路径模式匹配的路径数组”。 This function calls then org.apache.hadoop.fs.Globber#glob to figure out the exact files matching algorithm for the glob pattern.这个函数然后调用org.apache.hadoop.fs.Globber#glob来找出 glob 模式的确切文件匹配算法。 globStatus is called by org.apache.spark.sql.execution.datasources.DataSource#checkAndGlobPathIfNecessary . globStatus 由org.apache.spark.sql.execution.datasources.DataSource#checkAndGlobPathIfNecessary You can add some breakpoints to see how does it work under-the-hood.您可以添加一些断点以查看它在后台是如何工作的。

But long story short:但长话短说:

What happens computationally in the described scenario under the hood?在所描述的场景中计算会发生什么? Spark will just list through the partition values of A and B? Spark 只会列出 A 和 B 的分区值? Or it will scan through all the leaves (the actual files) too?或者它也会扫描所有叶子(实际文件)?

Spark will split your glob in 3 parts ["*", "*", "C=my_value"]. Spark 会将您的 glob 分成 3 部分 ["*", "*", "C=my_value"]。 Later, it will list files at every level by using Hadoop org.apache.hadoop.fs.FileSystem#listStatus(org.apache.hadoop.fs.Path) method.稍后,它将使用 Hadoop org.apache.hadoop.fs.FileSystem#listStatus(org.apache.hadoop.fs.Path)方法列出各个级别的文件。 For every file it will build a path and try to match it against the current pattern.对于每个文件,它将构建一个路径并尝试将其与当前模式匹配。 The matching files will be kept as "candidates" that will be filtered out only at the last step, when the algorithm will look for "C=my_value".匹配的文件将作为“候选”保留,仅在最后一步被过滤掉,此时算法将查找“C=my_value”。

Unless you have a lot of files, this operation shouldn't hurt you.除非你有很多文件,否则这个操作应该不会伤害到你。 And probably that's one of the reasons why you should rather keep less but bigger files (famous data engineering problem of "too many small files").也许这就是为什么你应该保留更少但更大的文件的原因之一(著名的“小文件太多”的数据工程问题)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark不会忽略空分区 - Spark not ignoring empty partitions RDD 中的分区数和 Spark 中的性能 - Number of partitions in RDD and performance in Spark 连接的Spark性能分析 - Spark performance analysis for joins 我在哪里可以找到Spark的运营成本? - Where can I find the cost of the operations in Spark? 使用 reduceByKey(numPartitions) 或 repartition 规范化 SPARK RDD 分区 - Normalize SPARK RDD partitions using reduceByKey(numPartitions) or repartition 使用Spark Streaming应用程序的Sparklens进行性能分析 - Performance analysis using Sparklens of Spark Streaming Application 使用交易但实际上没有进行任何查询会产生资源成本吗? - Does using a transaction but not actually making any queries have a resource cost? spark.sql.shuffle.partitions 和 spark.default.parallelism 有什么区别? - What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? SQL Server和TPC-H表分区性能分析较小的分区,较少的读取,较高的cpu成本 - SQL Server and TPC-H Table Partitioning Performance Analysis smaller partitions, fewer reads, higher cpu costs 如何从Spark中获取从hdfs读取数据的时间成本 - How to get the time cost of reading data from hdfs in Spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM