Apache Spark：广播联接行为：联接表和临时表的过滤

Question

I need to join 2 tables in spark.我需要在 spark 中加入 2 个表。 But instead of joining 2 tables completely, I first filter out a part of second table:但是我没有完全加入 2 个表，而是先过滤掉第二个表的一部分：

spark.sql("select * from a join b on a.key=b.key where b.value='xxx' ")

I want to use broadcast join in this case.在这种情况下，我想使用广播连接。

Spark has a parameter which defines max table size for broadcast join: spark.sql.autoBroadcastJoinThreshold : Spark 有一个参数定义了广播连接的最大表大小： spark.sql.autoBroadcastJoinThreshold ：

Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join.配置表的最大大小（以字节为单位），该表将在执行连接时广播到所有工作节点。 By setting this value to -1 broadcasting can be disabled.通过将此值设置为 -1 可以禁用广播。 Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.请注意，当前仅支持已运行命令 ANALYZE TABLE COMPUTE STATISTICS noscan 的 Hive Metastore 表的统计信息。 http://spark.apache.org/docs/2.4.0/sql-performance-tuning.html http://spark.apache.org/docs/2.4.0/sql-performance-tuning.html

I have following questions about this setup:我对此设置有以下疑问：

which table size spark will compare with autoBroadcastJoinThreshold's value: FULL size, or size AFTER applying where clause?哪个表大小 spark 将与 autoBroadcastJoinThreshold 的值进行比较：完整大小，或应用 where 子句后的大小？
I am assuming that spark will apply where clause BEFORE broadcasting, correct?我假设 spark 将在广播前应用where子句，对吗？
the doc says I need to run Hive's Analyze Table command beforehand.文档说我需要事先运行 Hive 的分析表命令。 How it will work in a case when I am using temp view as a table?当我将临时视图用作表格时，它将如何工作？ As far as I understand I cannot run Analyze Table command against spark's temp view created via dataFrame.createorReplaceTempView("b").据我所知，我无法针对通过 dataFrame.createorReplaceTempView("b") 创建的 spark 临时视图运行分析表命令。 Can I broadcast temp view contents?我可以广播临时视图内容吗？

Answer 1

Understanding for option 2 is correct.对选项 2 的理解是正确的。 You can not analyze a TEMP table in spark .您无法在 spark 中分析 TEMP 表。 Read here在这里阅读

In case you want to take the lead and want to specify the dataframe which you want to broadcast, instead spark decides, can use below snippet-如果您想带头并想指定要广播的数据帧，而不是 spark 决定，可以使用以下代码段-

df = df1.join(F.broadcast(df2),df1.some_col == df2.some_col, "left")

Answer 2

I went ahead and did some small experiments to answer your 1st question.我继续做了一些小实验来回答你的第一个问题。

Question 1 :问题 1：

created a dataframe a with 3 rows [key,df_a_column]创建了a包含 3 行 [key,df_a_column] 的数据框 a
created a dataframe b with 10 rows [key,value]创建了一个包含 10 行 [key,value] 的数据框b
ran: spark.sql("SELECT * FROM a JOIN b ON a.key = b.key").explain()运行： spark.sql("SELECT * FROM a JOIN b ON a.key = b.key").explain()

== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildLeft, false
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#168]
:  +- LocalTableScan [key#122, df_a_column#123]
+- *(1) LocalTableScan [key#111, value#112]

As expected the Smaller df a with 3 rows is broadcasted.正如预期的那样，广播了 3 行的 Smaller df a 。

Ran : spark.sql("SELECT * FROM a JOIN b ON a.key = b.key where b.value=\\"bat\\"").explain() Ran : spark.sql("SELECT * FROM a JOIN b ON a.key = b.key where b.value=\\"bat\\"").explain()

== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildRight, false
:- *(1) LocalTableScan [key#122, df_a_column#123]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#152]
   +- LocalTableScan [key#111, value#112]

Here you can notice the dataframe b is Broadcasted !在这里您可以注意到数据帧b是 Broadcasted ！ meaning spark evaluates the size AFTER applying where for choosing which one to broadcast.意思是 spark 在size AFTER applying where选择要广播的size AFTER applying where评估size AFTER applying where 。

Question 2 :问题2 ：

Yes you are right.是的，你是对的。 It's evident from the previous output it applies where first.从前面的输出中可以明显看出它首先适用于何处。

Question 3 : No you cannot analyse but you can broadcast tempView table by hinting spark about it even in SQL.问题 3：不，您无法分析，但即使在 SQL 中也可以通过提示 spark 来广播 tempView 表。 ref参考

Example : spark.sql("SELECT /*+ BROADCAST(b) */ * FROM a JOIN b ON a.key = b.key")示例： spark.sql("SELECT /*+ BROADCAST(b) */ * FROM a JOIN b ON a.key = b.key")

And if you see explain now :如果你现在看到解释：

== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildRight, false
:- *(1) LocalTableScan [key#122, df_a_column#123]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#184]
   +- LocalTableScan [key#111, value#112]

Now if you see, dataframe b is broadcasted even though it has 10 rows.现在，如果您看到，即使数据帧b有 10 行，它也会被广播。 In question 1, without the hint , a was broadcasted .在问题 1 中，没有提示， a被广播。

Note: Broadcast hint in SQL spark is available for 2.2注意：SQL spark 中的广播提示可用于 2.2

Tips to understand the physical plan :了解物理计划的提示：

Figure out the dataframe from the LocalTableScan[ list of columns ]从LocalTableScan[ list of columns ]找出数据LocalTableScan[ list of columns ]
The dataframe present under the sub tree/list of BroadcastExchange is being broadcasted.正在广播BroadcastExchange的子树/列表下的数据帧。

Apache Spark：广播联接行为：联接表和临时表的过滤

问题描述

2 个解决方案

解决方案1
2 2021-07-08 06:34:23

解决方案2
1 已采纳 2021-07-08 07:56:55

Apache Spark：广播联接行为：联接表和临时表的过滤

问题描述

2 个解决方案

解决方案1 2 2021-07-08 06:34:23

解决方案2 1 已采纳 2021-07-08 07:56:55

解决方案1
2 2021-07-08 06:34:23

解决方案2
1 已采纳 2021-07-08 07:56:55