简体   繁体   中英

Difference in Spark SQL Shuffle partitions

I am trying to understand Spark Sql Shuffle Partitions which is set to 200 by default. The data looks like this, followed by the number of partitions created for the two cases.

scala> flightData2015.show(3)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
+-----------------+-------------------+-----+

scala> println(flightData2015.sort("DEST_COUNTRY_NAME").rdd.getNumPartitions)
104

scala> println(flightData2015.groupBy("DEST_COUNTRY_NAME").count().rdd.getNumPartitions)
200

Both cases cause a Shuffle stage which should result in 200 partitions (default value). Can someone explain why there is a difference?

According to the Spark documentation there are two ways of repartition the data. One is via this configuration spark.sql.shuffle.partitions as default 200 and is always applied when you run any join or aggregation as you can see here .

When we are talking about sort() this is a not that simple, Spark uses a planner to identify how skewed is the data across the dataset. If it is not too skewed it instead of using a sort-merge join that would result in 200 partitions as you expected, it prefers to do a broadcast of the data across the partitions avoiding a full shuffle. This can save time during the sorting to reduce amount of network traffic more details here .

The difference between these two situations is that sort and groupBy are using different partitioner under the hood.

  1. groupBy - is using hashPartitioning which means that it computes hash of the key and then it computes pmod by 200 (or whatever is set as the number of shuffle partitions) so it will always create 200 partitions (even though some of them may be empty)
  2. sort / orderBy - is using rangePartitioning which means that it runs a separate job to sample the data and based on that it creates the boundaries for the partitions (trying to make them 200). Now based on the sampled data distribution and the actual row count it may create boundaries for less than 200, which is the reason why you got only 104.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM