[英]Repartition does not affect number of tasks
How do I increase the number of tasks in order to reduce the amount of memory per task needed?如何增加任务数量以减少每个任务所需的内存量?
The following very simple example fails:以下非常简单的示例失败了:
df = (
spark
.read
.format('delta')
.load(input_path)
)
df = df.orderBy("contigName", "start", "end")
# write ordered dataset back to disk:
(
df
.write
.format("delta")
.save(output_path)
)
However, no matter what I do, the Spark UI shows me exactly 1300 tasks and crashes after 168 tasks with Job aborted due to stage failure: Total size of serialized results of 168 tasks [...] is bigger than spark.driver.maxResultSize [...]
.但是,无论我做什么,Spark UI 都会向我显示 1300 个任务,并且在 168 个任务
Job aborted due to stage failure: Total size of serialized results of 168 tasks [...] is bigger than spark.driver.maxResultSize [...]
后崩溃:168 个任务Job aborted due to stage failure: Total size of serialized results of 168 tasks [...] is bigger than spark.driver.maxResultSize [...]
.
Further, I tried the following commands:此外,我尝试了以下命令:
df.orderBy("contigName", "start", "end").limit(5).toPandas()
works df.orderBy("contigName", "start", "end").limit(5).toPandas()
有效df.orderBy("contigName", "start", "end").write.format("delta").save(output_path)
fails with Total size of serialized results of 118 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)
df.orderBy("contigName", "start", "end").write.format("delta").save(output_path)
失败Total size of serialized results of 118 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)
df.orderBy("contigName", "start", "end") .persist(pyspark.StorageLevel.MEMORY_AND_DISK).limit(5).toPandas()
fails as well df.orderBy("contigName", "start", "end") .persist(pyspark.StorageLevel.MEMORY_AND_DISK).limit(5).toPandas()
失败EDIT: Thanks to @raphael-roth I could tried the following spark config:编辑:感谢@raphael-roth,我可以尝试以下火花配置:
spark = (
SparkSession.builder
.appName('abc')
.config("spark.local.dir", os.environ.get("TMP"))
.config("spark.sql.execution.arrow.enabled", "true")
.config("spark.sql.shuffle.partitions", "2001")
.config("spark.driver.maxResultSize", "4G")
.getOrCreate()
)
glow.register(spark)
spark
However, this still does not affect the number of tasks.但是,这仍然不会影响任务的数量。
orderBy
will generate spark.sql.shuffle.partitions
partitions/taks (default=200), no matter how many partitions the input-DataFrame has. orderBy
将生成spark.sql.shuffle.partitions
分区/任务(默认 = 200),无论输入数据帧有多少个分区。 So increasing this number should solve your problem (unfortunately, it cannot be specified in the method call)所以增加这个数字应该可以解决你的问题(不幸的是,它不能在方法调用中指定)
Alternatively, think about using something like repartition(key).sortWithinPartitions(key,attr1,attr2,...)
, this will only generate 1 shuffle instead of 2或者,考虑使用类似
repartition(key).sortWithinPartitions(key,attr1,attr2,...)
,这只会生成 1 次 shuffle 而不是 2
You can specify number of partitions to be created in 2 ways:您可以通过两种方式指定要创建的分区数:
df.repartition(800, "hdr_membercode").write.format(table_format).save(full_path, mode=write_mode)
df.repartition(800, "hdr_membercode").write.format(table_format).save(full_path, mode=write_mode)
from spark-submit command-line argument:来自 spark-submit 命令行参数:
* *
--conf "spark.sql.shuffle.partitions=450"
--conf "spark.sql.shuffle.partitions=450"
* *
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.