[英]Is "spark.sql.shuffle.partitions" configuration affects non sql shuffling?
We don't have a lot of SQL in our Spark jobs (That is a problem I know but for now its a fact).我们的 Spark 作业中没有很多 SQL(我知道这是一个问题,但现在它是一个事实)。 I want to optimize the size and number of our Spark shuffle partitions to optimize our Spark usage.
我想优化我们的 Spark 随机分区的大小和数量,以优化我们的 Spark 使用。 I saw in a lot of sources that setting
spark.sql.shuffle.partitions
is a good option.我在很多资源中看到设置
spark.sql.shuffle.partitions
是一个不错的选择。 But will it do any effect if we almost do not use spark SQL?但是如果我们几乎不用spark SQL会有什么影响吗?
Indeed spark.sql.shuffle.partitions
has no effect on jobs defined through the RDD api.事实上
spark.sql.shuffle.partitions
对通过 RDD api 定义的作业没有影响。
The configuration you are looking for is spark.default.parallelism
, according to the documentation :根据文档,您正在寻找的配置是
spark.default.parallelism
:
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
当用户未设置时,由连接、reduceByKey 和并行化等转换返回的 RDD 中的默认分区数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.