简体   繁体   中英

How wide transformations are influenced by shuffle partition config

How does wide transformations actually work based on shuffle partitions configuration?

If I have following program:

spark.conf.set("spark.sql.shuffle.partitions", "5")
val df = spark
    .read
    .option("inferSchema", "true")
    .option("header", "true")
    .csv("...\input.csv")
df.sort("sal").take(200)

Does it mean sort would output 5 new partitions(as configured), and then spark takes 200 records from those 5 partitions?

As mentioned in comment your sample code is not affected because this sort is not going to trigger shuffle, in plan you will find something like this

 == Physical Plan ==
 TakeOrderedAndProject (2)
 +- Scan csv  (1)

But for example when you do some join later (or any other wide transformation which will trigger shuffle) you can see that during exchange value from this parameter is going to be used (check number of partitions row)

排序合并连接

This may not be the case when adaptive query execution is enabled, in such situation it may look like this

启用 AQE

Now you can see that at the beginning value from spark.sql.shuffle.partitions was used but later due to AQE Spark changed plan and on shuffle read number of partitions was changed to 8 (you may also see that SMJ was changed to broadcast hash join - it was also done by AQE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM