简体   繁体   English

在 Spark SQL 中将多个小表与大表连接的最佳方法

[英]Best way to join multiples small tables with a big table in Spark SQL

I'm doing a join multiples tables using spark sql.我正在使用 spark sql 连接多个表。 One of the table is very big and the others are small (10-20 records).其中一张桌子很大,其他桌子很小(10-20 条记录)。 really I want to replace values in the biggest table using others tables that contain pairs of key-value.我真的想使用包含键值对的其他表替换最大表中的值。

ie Bigtable:即大表:

| Col 1 | Col 2 | Col 3 | Col 4 | ....
--------------------------------------
| A1    | B1    | C1    | D1    | ....
| A2    | B1    | C2    | D2    | ....
| A1    | B1    | C3    | D2    | ....
| A2    | B2    | C3    | D1    | ....
| A1    | B2    | C2    | D1    | ....
.
.
.
.
.

Table2:表2:

| Col 1 | Col 2 
----------------
| A1    | 1a    
| A2    | 2a    

Table3:表3:

| Col 1 | Col 2 
----------------
| B1    | 1b    
| B2    | 2b  

Table3:表3:

| Col 1 | Col 2 
----------------
| C1    | 1c    
| C2    | 2c  
| C3    | 3c

Table4:表4:

| Col 1 | Col 2 
----------------
| D1    | 1d    
| D2    | 2d  

Expected table is预期表是

| Col 1 | Col 2 | Col 3 | Col 4 | ....
--------------------------------------
| 1a    | 1b    | 1c    | 1d    | ....
| 2a    | 1b    | 2c    | 2d    | ....
| 1a    | 1b    | 3c    | 2d    | ....
| 2a    | 2b    | 3c    | 1d    | ....
| 1a    | 2b    | 2c    | 1d    | ....
.
.
.
.
.

My question is;我的问题是; which is best way to join the tables .这是加入表格的最佳方式 (Think that there are 100 or more small tables) 1) Collecting the small dataframes, to transforming it to maps, broadcasting the maps and transforming the big datataframe in one only step (认为​​有100个或更多的小表) 1)收集小数据框,将其转换为地图,广播地图和转换大数据框只需一步

bigdf.transform(ds.map(row => (small1.get(row.col1),.....)

2) Broadcasting the tables and making join using select method. 2)广播表并使用选择方法进行连接。

spark.sql("
       select * 
       from bigtable
       left join small1 using(id1) 
       left join small2 using(id2)")

3) Broadcasting the tables and Concatenate multiples joins 3) 广播表和 Concatenate multiples joins

bigtable.join(broadcast(small1), bigtable('col1') ==small1('col1')).join...

Thanks in advance提前致谢

You might do:你可能会这样做:

  1. broadcast all small tables (automaticaly done by setting spark.sql.autoBroadcastJoinThreshold slightly superior to the small table number of rows)广播所有小表(通过设置spark.sql.autoBroadcastJoinThreshold自动完成,略优于小表行数)
  2. run a sql query that join the big table such运行一个连接大表的 sql 查询,例如

    val df = spark.sql(" select * from bigtable left join small1 using(id1) left join small2 using(id2)")

EDIT: Choosing between sql and spark "dataframe" syntax: The sql syntax is more readable, and less verbose than the spark syntax (for a database user perspective.) From a developper perspective, dataframe syntax might be more readeble.编辑:在 sql 和 spark "dataframe" 语法之间进行选择:sql 语法比 spark 语法更具可读性,而且不那么冗长(对于数据库用户的角度)。从开发人员的角度来看,数据框语法可能更易读。

The main advantage of using the "dataset" syntax, is the compiler will be able to track some error.使用“数据集”语法的主要优点是编译器将能够跟踪一些错误。 Using any string syntax such sql or columns name (col("mycol")) will be spotted at run time.使用任何字符串语法,如 sql 或列名 (col("mycol")) 将在运行时被发现。

If the data in your small tables is less than the threshold size and physical files for your data is in parquet format then spark will automatically broadcast the small tables but if you are reading the data from some other data sources like sql, PostgreSQL etc. then some times spark does not broadcast the table automatically.如果小表中的数据小于阈值大小并且数据的物理文件是拼花格​​式,那么 spark 将自动广播小表,但如果您从其他一些数据源(如 sql、PostgreSQL 等)读取数据,则有时spark不会自动广播表。

If you know that the tables are small sized and the size of table is not expected to increase ( In case of lookup tables) you can explicitly broadcast the data frame or table and in this way you can efficiently join a larger table with the small tables.如果您知道表很小并且表的大小预计不会增加(在查找表的情况下),您可以显式广播数据框或表,这样您就可以有效地将大表与小表连接起来.

you can verify that the small table is getting broadcasted using the explain command on the data frame or you can do that from Spark UI also.您可以使用数据帧上的解释命令验证小表是否正在广播,或者您也可以从 Spark UI 执行此操作。

Best way, as already written in answers, to broadcast all small tables.正如答案中已经写的那样,广播所有小表格的最佳方式。 And us can be done by SQL only - with hint:而我们只能通过 SQL 来完成 - 提示:

val df = spark.sql("""
    select /*+ BROADCAST(t2, t3) */
        * 
    from bigtable t1
        left join small1 t2 using(id1) 
        left join small2 t3 using(id2)
""")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM