简体   繁体   English

spark.sql.shuffle.partitions的200个默认分区难题

[英]spark.sql.shuffle.partitions of 200 default partitions conundrum

In many posts there is the statement - as shown below in some form or another - due to some question on shuffling, partitioning, due to JOIN, AGGR, whatever, etc.: 在许多帖子中都有声明 - 如下面以某种形式所示 - 由于关于洗牌,分区,由于JOIN,AGGR,等等的一些问题:

... In general whenever you do a spark sql aggregation or join which shuffles data this is the number of resulting partitions = 200. This is set by spark.sql.shuffle.partitions. ...一般情况下,无论何时执行spark sql聚合或连接数据,这是结果分区的数量= 200.这是由spark.sql.shuffle.partitions设置的。 ... ...

So, my question is: 所以,我的问题是:

  • Do we mean that if we have set partitioning at 765 for a DF, for example, 我们的意思是,如果我们为DF设置了765的分区,例如,
    • That the processing occurs against 765 partitions, but that the output is coalesced / re-partitioned standardly to 200 - referring here to word resulting ? 处理是针对765个分区进行的,但是输出被标准地合并/重新分区为200 - 这里指的是单词结果
    • Or does it do the processing using 200 partitions after coalescing / re-partitioning to 200 partitions before JOINing, AGGR? 或者在JOINing,AGGR之前合并/重新分区到200个分区后,是否使用200个分区进行处理?

I ask as I never see a clear viewpoint. 我问,因为我从未看到一个明确的观点。

I did the following test: 我做了以下测试:

// genned ad DS of some 20M short rows
df0.count
val ds1 = df0.repartition(765)
ds1.count
val ds2 = df0.repartition(765)
ds2.count

sqlContext.setConf("spark.sql.shuffle.partitions", "765")
// The above not included on 1st run, the above included on 2nd run.

ds1.rdd.partitions.size
ds2.rdd.partitions.size

val joined = ds1.join(ds2, ds1("time_asc") === ds2("time_asc"), "outer") 
joined.rdd.partitions.size
joined.count
joined.rdd.partitions.size

On the 1st test - not defining sqlContext.setConf("spark.sql.shuffle.partitions", "765") , the processing and num partitions resulted was 200. Even though SO post 45704156 states it may not apply to DFs - this is a DS. 在第一次测试 - 没有定义 sqlContext.setConf(“spark.sql.shuffle.partitions”,“765”)时 ,处理和num分区结果是200.即使SO 45704156声明它可能不适用于DF - 这是一个DS。

On the 2nd test - defining sqlContext.setConf("spark.sql.shuffle.partitions", "765") , the processing and num partitions resulted was 765. Even though SO post 45704156 states it may not apply to DFs - this is a DS. 在第二个测试 - 定义 sqlContext.setConf(“spark.sql.shuffle.partitions”,“765”) ,处理和num分区结果是765.即使SO 45704156声明它可能不适用于DF - 这是一个DS。

It is a combination of both your guesses. 这是你猜测的结合。

Assume you have a set of input data with M partitions and you set shuffle partitions to N. 假设您有一组带有M个分区的输入数据,并将shuffle分区设置为N.

When executing a join, spark reads your input data in all M partitions and re-shuffle the data based on the key to N partitions. 执行连接时,spark会读取所有M个分区中的输入数据,并根据N个分区的密钥重新混洗数据。 Imagine a trivial hashpartitioner, the hash function applied on the key pretty much looks like A = hashcode(key) % N, and then this data is re-allocated to the node in charge of handling the Ath partition. 想象一个简单的散列分区,应用在密钥上的散列函数几乎看起来像A = hashcode(key)%N,然后将这些数据重新分配给负责处理Ath分区的节点。 Each node can be in charge of handling multiple partitions. 每个节点可以负责处理多个分区。

After shuffling, the nodes will work to aggregate the data in partitions they are in charge of. 在混洗之后,节点将努力聚合他们负责的分区中的数据。 As no additional shuffling needs to be done here, the nodes can produce the output directly. 由于此处不需要进行额外的改组,因此节点可以直接生成输出。

So in summary, your output will be coalesced to N partitions, however it is coalesced because it is processed in N partitions, not because spark applies one additional shuffle stage to specifically repartition your output data to N. 总而言之,您的输出将合并为N个分区,但它会合并,因为它在N个分区中处理,而不是因为spark应用一个额外的shuffle阶段来专门将输出数据重新分区为N.

Spark.sql.shuffle.partitions is the parameter which decides the number of partitions while doing shuffles like joins or aggregation ie where data movement is there across the nodes. Spark.sql.shuffle.partitions是在进行连接(如连接或聚合)时决定分区数量的参数,即跨节点的数据移动。 The other part spark.default.parallelism will be calculated on basis of your data size and max block size, in HDFS it's 128mb. 另一部分spark.default.parallelism将根据您的数据大小和最大块大小计算,在HDFS中它是128mb。 So if your job does not do any shuffle it will consider the default parallelism value or if you are using rdd you can set it by your own. 因此,如果您的工作没有进行任何随机播放,则会考虑默认的并行度值,或者如果您使用的是rdd,则可以自行设置。 While shuffling happens it will take 200. 虽然洗牌发生,但需要200。

Val df = sc.parallelize(List(1,2,3,4,5),4).toDF() df.count() // this will use 4 partitions Val df = sc.parallelize(List(1,2,3,4,5),4).toDF()df.count()//这将使用4个分区

Val df1 = df df1.except(df).count // will generate 200 partitions having 2 stages Val df1 = df df1.except(df).count //将生成200个具有2个阶段的分区

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 当 shuffle 分区大于 200 时会发生什么(数据帧中的 spark.sql.shuffle.partitions 200(默认情况下)) - what happens when shuffle partition is greater than 200( spark.sql.shuffle.partitions 200(by default) in dataframe) spark.sql.shuffle.partitions 本地火花性能行为 - spark.sql.shuffle.partitions local spark performance behavior spark.sql.shuffle.partitions 和 spark.default.parallelism 有什么区别? - What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? 无法使用 Spark Structured Streaming 覆盖“spark.sql.shuffle.partitions”的默认值 - Unable to overwrite default value of "spark.sql.shuffle.partitions" with Spark Structured Streaming “spark.sql.shuffle.partitions”配置是否影响非 sql 洗牌? - Is "spark.sql.shuffle.partitions" configuration affects non sql shuffling? 我们可以在AWS Glue中设置spark.sql.shuffle.partitions吗? - Can we set spark.sql.shuffle.partitions in AWS Glue? 如何动态选择spark.sql.shuffle.partitions - How to dynamically choose spark.sql.shuffle.partitions 如何在 pyspark 中设置动态 spark.sql.shuffle.partitions? - How to set dynamic spark.sql.shuffle.partitions in pyspark? 如何将“spark.sql.shuffle.partitions”设置为自动 - How to set "spark.sql.shuffle.partitions" to auto spark.sql.shuffle.partitions 究竟指的是什么? - What does spark.sql.shuffle.partitions exactly refer to?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM