簡體 English 中英

如何在Spark中創建更多分區而不會導致混亂

[英]How can I make more partitions in Spark without causing a shuffle

原文 2016-10-21 14:22:04 1 1 scala/ apache-spark

基本上我的用例是這樣的，在第一階段，我只能有幾個分區，因為每個任務運行一個C程序，需要多達10 GB的內存。 但是，我稍后會使用RangePartitioner。 但是在前一階段中只有很少的分區，RangePartitioner在執行后綴時會拋出內存錯誤。 這是一個眾所周知的事實，當你的分區太少時，Spark可以在shuffle中拋出內存錯誤。

現在，我想要的是簡單地將已存在的分區划分為更多分區。 基本上，與Spark中的聯合相反。 如果我使用分區器，例如HashPartitioner，它顯然會導致shuffle，我想避免。 那么，我怎樣才能做到這一點？

1 個解決方案

不是在這個時候。 您可以追蹤相關的JIRA門票： https ： //issues.apache.org/jira/browse/SPARK-5997

Spark：增加分區數量而不會導致shuffle？

[英]Spark: increase number of partitions without causing a shuffle?

Spark SQL Shuffle 分區的區別

[英]Difference in Spark SQL Shuffle partitions

spark shuffle partitions 和 partition by tag 如何相互配合

[英]How spark shuffle partitions and partition by tag along with each other

Spark Join *無*洗牌

[英]Spark join *without* shuffle

spark.sql.shuffle.partitions 本地火花性能行為

[英]spark.sql.shuffle.partitions local spark performance behavior

如何在不產生 .rdd 成本的情況下檢查 Spark DataFrame 的分區數

[英]How to check the number of partitions of a Spark DataFrame without incurring the cost of .rdd

將 Spark 2.4.5 升級到 Spark 3.3.2 導致 Shuffle 失敗

[英]Upgrading Spark 2.4.5 to Spark 3.3.2 Causing Shuffle Failures

如何在Spark數據幀中混洗行？

[英]How to shuffle the rows in a Spark dataframe?

無法使用 Spark Structured Streaming 覆蓋“spark.sql.shuffle.partitions”的默認值

[英]Unable to overwrite default value of "spark.sql.shuffle.partitions" with Spark Structured Streaming

如何使這個方法更加Scalalicious

[英]How can I make this method more Scalalicious

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 Spark：增加分區數量而不會導致shuffle？ Spark SQL Shuffle 分區的區別 spark shuffle partitions 和 partition by tag 如何相互配合 Spark Join *無*洗牌 spark.sql.shuffle.partitions 本地火花性能行為如何在不產生 .rdd 成本的情況下檢查 Spark DataFrame 的分區數將 Spark 2.4.5 升級到 Spark 3.3.2 導致 Shuffle 失敗如何在Spark數據幀中混洗行？無法使用 Spark Structured Streaming 覆蓋“spark.sql.shuffle.partitions”的默認值如何使這個方法更加Scalalicious

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM