I am trying to understand Spark Sql Shuffle Partitions which is set to 200 by default. The data looks like this, followed by the number of partitions created for the two cases.
scala> flightData2015.show(3)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
| United States| Romania| 15|
| United States| Croatia| 1|
| United States| Ireland| 344|
+-----------------+-------------------+-----+
scala> println(flightData2015.sort("DEST_COUNTRY_NAME").rdd.getNumPartitions)
104
scala> println(flightData2015.groupBy("DEST_COUNTRY_NAME").count().rdd.getNumPartitions)
200
Both cases cause a Shuffle stage which should result in 200 partitions (default value). Can someone explain why there is a difference?
According to the Spark documentation there are two ways of repartition the data. One is via this configuration spark.sql.shuffle.partitions
as default 200 and is always applied when you run any join or aggregation as you can see here .
When we are talking about sort()
this is a not that simple, Spark uses a planner to identify how skewed is the data across the dataset. If it is not too skewed it instead of using a sort-merge join
that would result in 200 partitions as you expected, it prefers to do a broadcast
of the data across the partitions avoiding a full shuffle. This can save time during the sorting to reduce amount of network traffic more details here .
The difference between these two situations is that sort
and groupBy
are using different partitioner under the hood.
groupBy
- is using hashPartitioning
which means that it computes hash of the key and then it computes pmod
by 200 (or whatever is set as the number of shuffle partitions) so it will always create 200 partitions (even though some of them may be empty) sort
/ orderBy
- is using rangePartitioning
which means that it runs a separate job to sample the data and based on that it creates the boundaries for the partitions (trying to make them 200). Now based on the sampled data distribution and the actual row count it may create boundaries for less than 200, which is the reason why you got only 104.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.