spark.sql.shuffle.partitions的200个默认分区难题

Question

In many posts there is the statement - as shown below in some form or another - due to some question on shuffling, partitioning, due to JOIN, AGGR, whatever, etc.: 在许多帖子中都有声明 - 如下面以某种形式所示 - 由于关于洗牌，分区，由于JOIN，AGGR，等等的一些问题：

... In general whenever you do a spark sql aggregation or join which shuffles data this is the number of resulting partitions = 200. This is set by spark.sql.shuffle.partitions. ...一般情况下，无论何时执行spark sql聚合或连接数据，这是结果分区的数量= 200.这是由spark.sql.shuffle.partitions设置的。 ... ...

So, my question is: 所以，我的问题是：

Do we mean that if we have set partitioning at 765 for a DF, for example, 我们的意思是，如果我们为DF设置了765的分区，例如，
- That the processing occurs against 765 partitions, but that the output is coalesced / re-partitioned standardly to 200 - referring here to word resulting ? 处理是针对765个分区进行的，但是输出被标准地合并/重新分区为200 - 这里指的是单词结果？
- Or does it do the processing using 200 partitions after coalescing / re-partitioning to 200 partitions before JOINing, AGGR? 或者在JOINing，AGGR之前合并/重新分区到200个分区后，是否使用200个分区进行处理？

I ask as I never see a clear viewpoint. 我问，因为我从未看到一个明确的观点。

I did the following test: 我做了以下测试：

// genned ad DS of some 20M short rows
df0.count
val ds1 = df0.repartition(765)
ds1.count
val ds2 = df0.repartition(765)
ds2.count

sqlContext.setConf("spark.sql.shuffle.partitions", "765")
// The above not included on 1st run, the above included on 2nd run.

ds1.rdd.partitions.size
ds2.rdd.partitions.size

val joined = ds1.join(ds2, ds1("time_asc") === ds2("time_asc"), "outer") 
joined.rdd.partitions.size
joined.count
joined.rdd.partitions.size

On the 1st test - not defining sqlContext.setConf("spark.sql.shuffle.partitions", "765") , the processing and num partitions resulted was 200. Even though SO post 45704156 states it may not apply to DFs - this is a DS. 在第一次测试 - 没有定义 sqlContext.setConf（“spark.sql.shuffle.partitions”，“765”）时 ，处理和num分区结果是200.即使SO 45704156声明它可能不适用于DF - 这是一个DS。

On the 2nd test - defining sqlContext.setConf("spark.sql.shuffle.partitions", "765") , the processing and num partitions resulted was 765. Even though SO post 45704156 states it may not apply to DFs - this is a DS. 在第二个测试 - 定义 sqlContext.setConf（“spark.sql.shuffle.partitions”，“765”） ，处理和num分区结果是765.即使SO 45704156声明它可能不适用于DF - 这是一个DS。

Answer 1

It is a combination of both your guesses. 这是你猜测的结合。

Assume you have a set of input data with M partitions and you set shuffle partitions to N. 假设您有一组带有M个分区的输入数据，并将shuffle分区设置为N.

When executing a join, spark reads your input data in all M partitions and re-shuffle the data based on the key to N partitions. 执行连接时，spark会读取所有M个分区中的输入数据，并根据N个分区的密钥重新混洗数据。 Imagine a trivial hashpartitioner, the hash function applied on the key pretty much looks like A = hashcode(key) % N, and then this data is re-allocated to the node in charge of handling the Ath partition. 想象一个简单的散列分区，应用在密钥上的散列函数几乎看起来像A = hashcode（key）％N，然后将这些数据重新分配给负责处理Ath分区的节点。 Each node can be in charge of handling multiple partitions. 每个节点可以负责处理多个分区。

After shuffling, the nodes will work to aggregate the data in partitions they are in charge of. 在混洗之后，节点将努力聚合他们负责的分区中的数据。 As no additional shuffling needs to be done here, the nodes can produce the output directly. 由于此处不需要进行额外的改组，因此节点可以直接生成输出。

So in summary, your output will be coalesced to N partitions, however it is coalesced because it is processed in N partitions, not because spark applies one additional shuffle stage to specifically repartition your output data to N. 总而言之，您的输出将合并为N个分区，但它会合并，因为它在N个分区中处理，而不是因为spark应用一个额外的shuffle阶段来专门将输出数据重新分区为N.

Answer 2

Spark.sql.shuffle.partitions is the parameter which decides the number of partitions while doing shuffles like joins or aggregation ie where data movement is there across the nodes. Spark.sql.shuffle.partitions是在进行连接（如连接或聚合）时决定分区数量的参数，即跨节点的数据移动。 The other part spark.default.parallelism will be calculated on basis of your data size and max block size, in HDFS it's 128mb. 另一部分spark.default.parallelism将根据您的数据大小和最大块大小计算，在HDFS中它是128mb。 So if your job does not do any shuffle it will consider the default parallelism value or if you are using rdd you can set it by your own. 因此，如果您的工作没有进行任何随机播放，则会考虑默认的并行度值，或者如果您使用的是rdd，则可以自行设置。 While shuffling happens it will take 200. 虽然洗牌发生，但需要200。

Val df = sc.parallelize(List(1,2,3,4,5),4).toDF() df.count() // this will use 4 partitions Val df = sc.parallelize（List（1,2,3,4,5），4）.toDF（）df.count（）//这将使用4个分区

Val df1 = df df1.except(df).count // will generate 200 partitions having 2 stages Val df1 = df df1.except（df）.count //将生成200个具有2个阶段的分区

spark.sql.shuffle.partitions的200个默认分区难题

问题描述

2 个解决方案

解决方案1
3 2018-11-07 10:06:51

解决方案2
2 已采纳 2018-08-21 14:50:05

spark.sql.shuffle.partitions的200个默认分区难题

问题描述

2 个解决方案

解决方案1 3 2018-11-07 10:06:51

解决方案2 2 已采纳 2018-08-21 14:50:05

解决方案1
3 2018-11-07 10:06:51

解决方案2
2 已采纳 2018-08-21 14:50:05