Repartition does not affect number of tasks

Question

How do I increase the number of tasks in order to reduce the amount of memory per task needed?

The following very simple example fails:

df = (
    spark
    .read
    .format('delta')
    .load(input_path)
)
df = df.orderBy("contigName", "start", "end")

# write ordered dataset back to disk:
(
    df
    .write
    .format("delta")
    .save(output_path)
)

However, no matter what I do, the Spark UI shows me exactly 1300 tasks and crashes after 168 tasks with Job aborted due to stage failure: Total size of serialized results of 168 tasks [...] is bigger than spark.driver.maxResultSize [...] .

Further, I tried the following commands:

df.orderBy("contigName", "start", "end").limit(5).toPandas() works
df.orderBy("contigName", "start", "end").write.format("delta").save(output_path) fails with Total size of serialized results of 118 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)
df.orderBy("contigName", "start", "end") .persist(pyspark.StorageLevel.MEMORY_AND_DISK).limit(5).toPandas() fails as well

EDIT: Thanks to @raphael-roth I could tried the following spark config:

spark = (
    SparkSession.builder
    .appName('abc')
    .config("spark.local.dir", os.environ.get("TMP"))
    .config("spark.sql.execution.arrow.enabled", "true")
    .config("spark.sql.shuffle.partitions", "2001")
    .config("spark.driver.maxResultSize", "4G")
    .getOrCreate()
)
glow.register(spark)
spark

However, this still does not affect the number of tasks.

Answer 1

orderBy will generate spark.sql.shuffle.partitions partitions/taks (default=200), no matter how many partitions the input-DataFrame has. So increasing this number should solve your problem (unfortunately, it cannot be specified in the method call)

Alternatively, think about using something like repartition(key).sortWithinPartitions(key,attr1,attr2,...) , this will only generate 1 shuffle instead of 2

Answer 2

You can specify number of partitions to be created in 2 ways:

from code wherever you need (most probably will trigger shuffling across network):

df.repartition(800, "hdr_membercode").write.format(table_format).save(full_path, mode=write_mode)

from spark-submit command-line argument:
*

--conf "spark.sql.shuffle.partitions=450"

*

Repartition does not affect number of tasks

Question

2 answers

solution1
1 2019-12-20 15:19:06

solution2
1 2020-03-05 17:39:56

Repartition does not affect number of tasks

Question

2 answers

solution1 1 2019-12-20 15:19:06

solution2 1 2020-03-05 17:39:56

solution1
1 2019-12-20 15:19:06

solution2
1 2020-03-05 17:39:56