[英]Repartition does not affect number of tasks
如何增加任務數量以減少每個任務所需的內存量?
以下非常簡單的示例失敗了:
df = (
spark
.read
.format('delta')
.load(input_path)
)
df = df.orderBy("contigName", "start", "end")
# write ordered dataset back to disk:
(
df
.write
.format("delta")
.save(output_path)
)
但是,無論我做什么,Spark UI 都會向我顯示 1300 個任務,並且在 168 個任務Job aborted due to stage failure: Total size of serialized results of 168 tasks [...] is bigger than spark.driver.maxResultSize [...]
后崩潰:168 個任務Job aborted due to stage failure: Total size of serialized results of 168 tasks [...] is bigger than spark.driver.maxResultSize [...]
.
此外,我嘗試了以下命令:
df.orderBy("contigName", "start", "end").limit(5).toPandas()
有效df.orderBy("contigName", "start", "end").write.format("delta").save(output_path)
失敗Total size of serialized results of 118 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)
df.orderBy("contigName", "start", "end") .persist(pyspark.StorageLevel.MEMORY_AND_DISK).limit(5).toPandas()
失敗編輯:感謝@raphael-roth,我可以嘗試以下火花配置:
spark = (
SparkSession.builder
.appName('abc')
.config("spark.local.dir", os.environ.get("TMP"))
.config("spark.sql.execution.arrow.enabled", "true")
.config("spark.sql.shuffle.partitions", "2001")
.config("spark.driver.maxResultSize", "4G")
.getOrCreate()
)
glow.register(spark)
spark
但是,這仍然不會影響任務的數量。
orderBy
將生成spark.sql.shuffle.partitions
分區/任務(默認 = 200),無論輸入數據幀有多少個分區。 所以增加這個數字應該可以解決你的問題(不幸的是,它不能在方法調用中指定)
或者,考慮使用類似repartition(key).sortWithinPartitions(key,attr1,attr2,...)
,這只會生成 1 次 shuffle 而不是 2
您可以通過兩種方式指定要創建的分區數:
df.repartition(800, "hdr_membercode").write.format(table_format).save(full_path, mode=write_mode)
來自 spark-submit 命令行參數:
*
--conf "spark.sql.shuffle.partitions=450"
*
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.