[英]Difference in Spark SQL Shuffle partitions
I am trying to understand Spark Sql Shuffle Partitions which is set to 200 by default.我试图了解 Spark Sql Shuffle Partitions 默认设置为 200。 The data looks like this, followed by the number of partitions created for the two cases.
数据如下所示,后面是为这两种情况创建的分区数。
scala> flightData2015.show(3)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
| United States| Romania| 15|
| United States| Croatia| 1|
| United States| Ireland| 344|
+-----------------+-------------------+-----+
scala> println(flightData2015.sort("DEST_COUNTRY_NAME").rdd.getNumPartitions)
104
scala> println(flightData2015.groupBy("DEST_COUNTRY_NAME").count().rdd.getNumPartitions)
200
Both cases cause a Shuffle stage which should result in 200 partitions (default value).这两种情况都会导致一个 Shuffle 阶段,这应该会产生 200 个分区(默认值)。 Can someone explain why there is a difference?
有人可以解释为什么有区别吗?
According to the Spark documentation there are two ways of repartition the data.根据 Spark 文档,有两种重新分区数据的方法。 One is via this configuration
spark.sql.shuffle.partitions
as default 200 and is always applied when you run any join or aggregation as you can see here .一种是通过此配置
spark.sql.shuffle.partitions
作为默认值 200,并且始终在运行任何连接或聚合时应用,如您在此处看到的。
When we are talking about sort()
this is a not that simple, Spark uses a planner to identify how skewed is the data across the dataset.当我们谈论
sort()
时,这并不是那么简单,Spark 使用一个规划器来确定数据集中数据的倾斜程度。 If it is not too skewed it instead of using a sort-merge join
that would result in 200 partitions as you expected, it prefers to do a broadcast
of the data across the partitions avoiding a full shuffle.如果它不是太倾斜,而不是使用
sort-merge join
,如您预期的那样会导致 200 个分区,它更喜欢跨分区broadcast
数据,避免完全洗牌。 This can save time during the sorting to reduce amount of network traffic more details here .这可以节省排序期间的时间,以减少网络流量的更多详细信息。
The difference between these two situations is that sort
and groupBy
are using different partitioner under the hood.这两种情况的区别在于
sort
和groupBy
在底层使用了不同的分区器。
groupBy
- is using hashPartitioning
which means that it computes hash of the key and then it computes pmod
by 200 (or whatever is set as the number of shuffle partitions) so it will always create 200 partitions (even though some of them may be empty) groupBy
- 使用hashPartitioning
这意味着它计算密钥的 hash 然后它计算pmod
200(或任何设置为随机分区的数量)所以它总是会创建 200 个分区(即使其中一些可能是空的)sort
/ orderBy
- is using rangePartitioning
which means that it runs a separate job to sample the data and based on that it creates the boundaries for the partitions (trying to make them 200). sort
/ orderBy
- 正在使用rangePartitioning
,这意味着它运行一个单独的作业来对数据进行采样,并在此基础上为分区创建边界(试图使它们成为 200)。 Now based on the sampled data distribution and the actual row count it may create boundaries for less than 200, which is the reason why you got only 104.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.