如何在 Spark 中強制啟用廣播加入

Question

我有一個像這樣的火花 SQL 查詢-

SELECT /*+ BROADCASTJOIN (sbg_published.sk_e2e_web_all_vis) */
       a.* 
FROM 
       sbg_published.sk_e2e_web_all_vis a
LEFT JOIN 
       sbg_published.web_funnel_detail_v4 b
       ON a.col1 = b.col1

我正在使用spark.sql()運行這個查詢第一個表有大約1 million條記錄，第二個有1.5 billion records

我試圖強制 spark 使用廣播連接，但它正在采用 sortmerge 連接。

以下是我使用的火花參數

"spark.sql.autoBroadcastJoinThreshold" = "4048576000"
"spark.sql.broadcastTimeout" = "100000"
"spark.sql.shuffle.partitions" = 500
"spark.sql.adaptive.enabled" = "true"
"spark.sql.adaptive.coalescePartitions.enabled" = "true"
"spark.sql.adaptive.autoBroadcastJoinThreshold" ="4048576000"
"spark.sql.join.preferSortMergeJoin" = "false"
"spark.shuffle.io.maxRetries"="10"
"spark.dynamicAllocation.enabled"="true"
"spark.shuffle.service.enabled"="true"
"spark.shuffle.compress"="true"
"spark.shuffle.spill.compress"="true"
"spark.driver.maxResultSize"="0"

這是 DAG -

然后我也嘗試了這個參數 -

“spark.sql.join.preferSortMergeJoin”=“假”

這使得 sortmerge 加入到 go 並采用 shuffle hash 加入代替。

我正在使用火花 3.2

提前致謝！

Answer 1

除了 "spark.sql.autoBroadcastJoinThreshold" 之外，spark 的廣播大小限制為 8G。 一旦超過8G就不能強制spark廣播dataframe。 因此，您可以嘗試通過以下方式解決它：

改寫sql廣播小表。
通過聯合重寫 sql

如何在 Spark 中強制啟用廣播加入

問題描述

1 個解決方案

解決方案1
0 2022-08-18 07:29:11

如何在 Spark 中強制啟用廣播加入

問題描述

1 個解決方案

解決方案1 0 2022-08-18 07:29:11

解決方案1
0 2022-08-18 07:29:11