Spark SQL Shuffle 分区的区别

Question

I am trying to understand Spark Sql Shuffle Partitions which is set to 200 by default.我试图了解 Spark Sql Shuffle Partitions 默认设置为 200。 The data looks like this, followed by the number of partitions created for the two cases.数据如下所示，后面是为这两种情况创建的分区数。

scala> flightData2015.show(3)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
+-----------------+-------------------+-----+

scala> println(flightData2015.sort("DEST_COUNTRY_NAME").rdd.getNumPartitions)
104

scala> println(flightData2015.groupBy("DEST_COUNTRY_NAME").count().rdd.getNumPartitions)
200

Both cases cause a Shuffle stage which should result in 200 partitions (default value).这两种情况都会导致一个 Shuffle 阶段，这应该会产生 200 个分区（默认值）。 Can someone explain why there is a difference?有人可以解释为什么有区别吗？

Answer 1

According to the Spark documentation there are two ways of repartition the data.根据 Spark 文档，有两种重新分区数据的方法。 One is via this configuration spark.sql.shuffle.partitions as default 200 and is always applied when you run any join or aggregation as you can see here .一种是通过此配置spark.sql.shuffle.partitions作为默认值 200，并且始终在运行任何连接或聚合时应用，如您在此处看到的。

When we are talking about sort() this is a not that simple, Spark uses a planner to identify how skewed is the data across the dataset.当我们谈论sort()时，这并不是那么简单，Spark 使用一个规划器来确定数据集中数据的倾斜程度。 If it is not too skewed it instead of using a sort-merge join that would result in 200 partitions as you expected, it prefers to do a broadcast of the data across the partitions avoiding a full shuffle.如果它不是太倾斜，而不是使用sort-merge join ，如您预期的那样会导致 200 个分区，它更喜欢跨分区broadcast数据，避免完全洗牌。 This can save time during the sorting to reduce amount of network traffic more details here .这可以节省排序期间的时间，以减少网络流量的更多详细信息。

Answer 2

The difference between these two situations is that sort and groupBy are using different partitioner under the hood.这两种情况的区别在于sort和groupBy在底层使用了不同的分区器。

groupBy - is using hashPartitioning which means that it computes hash of the key and then it computes pmod by 200 (or whatever is set as the number of shuffle partitions) so it will always create 200 partitions (even though some of them may be empty) groupBy - 使用hashPartitioning这意味着它计算密钥的 hash 然后它计算pmod 200（或任何设置为随机分区的数量）所以它总是会创建 200 个分区（即使其中一些可能是空的）
sort / orderBy - is using rangePartitioning which means that it runs a separate job to sample the data and based on that it creates the boundaries for the partitions (trying to make them 200). sort / orderBy - 正在使用rangePartitioning ，这意味着它运行一个单独的作业来对数据进行采样，并在此基础上为分区创建边界（试图使它们成为 200）。 Now based on the sampled data distribution and the actual row count it may create boundaries for less than 200, which is the reason why you got only 104.现在根据采样数据分布和实际行数，它可能会创建小于 200 的边界，这就是您只有 104 的原因。

Spark SQL Shuffle 分区的区别

问题描述

2 个解决方案

解决方案1
0 2020-06-22 06:49:31

解决方案2
0 2020-06-22 07:58:14

Spark SQL Shuffle 分区的区别

问题描述

2 个解决方案

解决方案1 0 2020-06-22 06:49:31

解决方案2 0 2020-06-22 07:58:14

解决方案1
0 2020-06-22 06:49:31

解决方案2
0 2020-06-22 07:58:14