简体   繁体   English

当 shuffle 分区大于 200 时会发生什么(数据帧中的 spark.sql.shuffle.partitions 200(默认情况下))

[英]what happens when shuffle partition is greater than 200( spark.sql.shuffle.partitions 200(by default) in dataframe)

spark sql aggregation operation which shuffles data ie spark.sql.shuffle.partitions 200(by default). spark sql 聚合操作,它对数据进行混洗,即 spark.sql.shuffle.partitions 200(默认情况下)。 what happens on performance when shuffle partition is greater than 200.当 shuffle 分区大于 200 时,性能会发生什么变化。

Spark uses a different data structure for shuffle book-keeping when the number of partitions is greater than 2000. so if number of partitions is near to 2000 then increase it to more than 2000.当分区数大于 2000 时,Spark 使用不同的数据结构进行 shuffle bookkeeping。因此,如果分区数接近 2000,则将其增加到 2000 以上。

but my question is what will be the behavior when shuffle partition is greater than 200(lets say 300).但我的问题是当 shuffle 分区大于 200(比如说 300)时会出现什么行为。

The number 200 was selected as default based on the typical workloads on the relative big clusters with enough resources allocated for jobs.根据为作业分配足够资源的相对较大的集群上的典型工作负载,选择数字 200 作为默认值。 Otherwise this number should be selected based on the 2 factors - number of available cores, and partition size (it's recommended to keep partitions close to 100Mb).否则,应根据 2 个因素选择此数字 - 可用内核数和分区大小(建议将分区保持在 100Mb 左右)。 The selected number of partitions should be the multiply of the number of available cores, but shouldn't be very big (typically it's 1-3 x of number of cores).选择的分区数应该是可用内核数的乘积,但不应很大(通常是内核数的 1-3 倍)。 If number of partitions is greater than default, shouldn't change Spark's behavior - it will just increase the number of tasks that Spark will need to execute).如果分区数大于默认值,则不应更改 Spark 的行为 - 它只会增加 Spark 需要执行的任务数)。

You can watch this talk from Spark + AI Summit 2019 - it covers a lot of details on the optimization of the Spark programs, including selection of the number of partitions.你可以在 Spark + AI Summit 2019 上观看这个演讲——它涵盖了很多关于 Spark 程序优化的细节,包括分区数量的选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 spark.sql.shuffle.partitions的200个默认分区难题 - spark.sql.shuffle.partitions of 200 default partitions conundrum spark.sql.shuffle.partitions 和 spark.default.parallelism 有什么区别? - What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? spark.sql.shuffle.partitions 究竟指的是什么? - What does spark.sql.shuffle.partitions exactly refer to? spark.sql.shuffle.partitions 的最佳值应该是多少,或者在使用 Spark SQL 时我们如何增加分区? - What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL? spark.sql.shuffle.partitions 本地火花性能行为 - spark.sql.shuffle.partitions local spark performance behavior 无法使用 Spark Structured Streaming 覆盖“spark.sql.shuffle.partitions”的默认值 - Unable to overwrite default value of "spark.sql.shuffle.partitions" with Spark Structured Streaming “spark.sql.shuffle.partitions”配置是否影响非 sql 洗牌? - Is "spark.sql.shuffle.partitions" configuration affects non sql shuffling? 如何动态选择spark.sql.shuffle.partitions - How to dynamically choose spark.sql.shuffle.partitions 如何在 pyspark 中设置动态 spark.sql.shuffle.partitions? - How to set dynamic spark.sql.shuffle.partitions in pyspark? 如何将“spark.sql.shuffle.partitions”设置为自动 - How to set "spark.sql.shuffle.partitions" to auto
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM