简体   繁体   中英

Spark SQL QUERY join on Same column name

I am writing a join query for 2 dataframes. I have to perform join on column which has same name in both dataframes. How can I write it in Query?

var df1 = Seq((1,"har"),(2,"ron"),(3,"fred")).toDF("ID", "NAME")
var df2 = Seq(("har", "HARRY"),("ron", "RONALD")).toDF("NAME", "ACTUALNAME")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")

I know we can do df3 = df1.join(df2, Seq("NAME")) where NAME is the common column. In this scenario df3 will have only ID, NAME, ACTUALNAME .

If we do it from SQL then query will be select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME . For this output dataframe will have ID, NAME, NAME, ACTUALNAME columns. How can I remove extra NAME column which came from df2 .

This does not work as well spark.sql("select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME").drop(df2("NAME"))

Is there a cleaner way to do this? Renaming df2 columns is the last option which I don't want to use. I have scenario where creating SQL queries is easier than dataframes so looking for only SPARK SQL Specific answers

try this you can use col() for referring column

scala> spark.sql("select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME").drop(col("table2.NAME")).show()
+---+----+----------+
| ID|NAME|ACTUALNAME|
+---+----+----------+
|  1| har|     HARRY|
|  2| ron|    RONALD|
|  3|fred|      null|
+---+----+----------+

If you do not apply an alias to the dataframe, you'll receive an error after you create your joined dataframe. With two columns named the same thing, referencing one of the duplicate named columns returns an error that essentially says it doesn't know which one you selected (Ambiguous). In SQL Server and other languages, the SQL engine wouldn't let that query go through or it would automatically append a prefix or suffix to that field name.

we can select the required fields in the sql query like below one

spark.sql("select A.ID,A.NAME,B.ACTUALNAME from table1 A LEFT OUTER JOIN table2 B ON table1.NAME = table2.NAME").show()

This is mostly an academic exercise, but you can also do it without the need to drop columns by switching on the ability of Spark SQL to interpret regular expressions in quoted identifiers, an ability inherited from Hive SQL. You need to set spark.sql.parser.quotedRegexColumnNames to true when building the Spark context for this to work.

$ spark-shell --master "local[*]" --conf spark.sql.parser.quotedRegexColumnNames=true
...
scala> spark.sql("select table1.*, table2.`^(?!NAME$).*$` from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME").show()
+---+----+----------+
| ID|NAME|ACTUALNAME|
+---+----+----------+
|  1| har|     HARRY|
|  2| ron|    RONALD|
|  3|fred|      null|
+---+----+----------+

Here

table2.`^(?!NAME$).*$`

resolves to all columns of table2 except NAME . Any valid Java regular expression should work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM