How wide transformations are influenced by shuffle partition config

Question

How does wide transformations actually work based on shuffle partitions configuration?

If I have following program:

spark.conf.set("spark.sql.shuffle.partitions", "5")
val df = spark
    .read
    .option("inferSchema", "true")
    .option("header", "true")
    .csv("...\input.csv")
df.sort("sal").take(200)

Does it mean sort would output 5 new partitions(as configured), and then spark takes 200 records from those 5 partitions?

Answer 1

As mentioned in comment your sample code is not affected because this sort is not going to trigger shuffle, in plan you will find something like this

 == Physical Plan ==
 TakeOrderedAndProject (2)
 +- Scan csv  (1)

But for example when you do some join later (or any other wide transformation which will trigger shuffle) you can see that during exchange value from this parameter is going to be used (check number of partitions row)

This may not be the case when adaptive query execution is enabled, in such situation it may look like this

Now you can see that at the beginning value from spark.sql.shuffle.partitions was used but later due to AQE Spark changed plan and on shuffle read number of partitions was changed to 8 (you may also see that SMJ was changed to broadcast hash join - it was also done by AQE)

How wide transformations are influenced by shuffle partition config

Question

1 answers

solution1
0 2022-10-06 19:32:27

How wide transformations are influenced by shuffle partition config

Question

1 answers

solution1 0 2022-10-06 19:32:27

solution1
0 2022-10-06 19:32:27