I need to join 2 tables in spark. But instead of joining 2 tables completely, I first filter out a part of second table:
spark.sql("select * from a join b on a.key=b.key where b.value='xxx' ")
I want to use broadcast join in this case.
Spark has a parameter which defines max table size for broadcast join: spark.sql.autoBroadcastJoinThreshold
:
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run. http://spark.apache.org/docs/2.4.0/sql-performance-tuning.html
I have following questions about this setup:
Understanding for option 2 is correct. You can not analyze a TEMP table in spark . Read here
In case you want to take the lead and want to specify the dataframe which you want to broadcast, instead spark decides, can use below snippet-
df = df1.join(F.broadcast(df2),df1.some_col == df2.some_col, "left")
I went ahead and did some small experiments to answer your 1st question.
Question 1 :
a
with 3 rows [key,df_a_column]b
with 10 rows [key,value]spark.sql("SELECT * FROM a JOIN b ON a.key = b.key").explain()
== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildLeft, false
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#168]
: +- LocalTableScan [key#122, df_a_column#123]
+- *(1) LocalTableScan [key#111, value#112]
As expected the Smaller df a
with 3 rows is broadcasted.
spark.sql("SELECT * FROM a JOIN b ON a.key = b.key where b.value=\\"bat\\"").explain()
== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildRight, false
:- *(1) LocalTableScan [key#122, df_a_column#123]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#152]
+- LocalTableScan [key#111, value#112]
Here you can notice the dataframe b
is Broadcasted ! meaning spark evaluates the size AFTER applying where
for choosing which one to broadcast.
Question 2 :
Yes you are right. It's evident from the previous output it applies where first.
Question 3 : No you cannot analyse but you can broadcast tempView table by hinting spark about it even in SQL. ref
Example : spark.sql("SELECT /*+ BROADCAST(b) */ * FROM a JOIN b ON a.key = b.key")
And if you see explain now :
== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildRight, false
:- *(1) LocalTableScan [key#122, df_a_column#123]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#184]
+- LocalTableScan [key#111, value#112]
Now if you see, dataframe b
is broadcasted even though it has 10 rows. In question 1, without the hint , a
was broadcasted .
Note: Broadcast hint in SQL spark is available for 2.2
Tips to understand the physical plan :
LocalTableScan[ list of columns ]
BroadcastExchange
is being broadcasted.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.