I am writing a join query for 2 dataframes. I have to perform join on column which has same name in both dataframes. How can I write it in Query?
var df1 = Seq((1,"har"),(2,"ron"),(3,"fred")).toDF("ID", "NAME")
var df2 = Seq(("har", "HARRY"),("ron", "RONALD")).toDF("NAME", "ACTUALNAME")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
I know we can do df3 = df1.join(df2, Seq("NAME"))
where NAME
is the common column. In this scenario df3
will have only ID, NAME, ACTUALNAME
.
If we do it from SQL then query will be select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME
. For this output dataframe will have ID, NAME, NAME, ACTUALNAME
columns. How can I remove extra NAME
column which came from df2
.
This does not work as well spark.sql("select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME").drop(df2("NAME"))
Is there a cleaner way to do this? Renaming df2
columns is the last option which I don't want to use. I have scenario where creating SQL queries is easier than dataframes so looking for only SPARK SQL Specific answers
try this you can use col() for referring column
scala> spark.sql("select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME").drop(col("table2.NAME")).show()
+---+----+----------+
| ID|NAME|ACTUALNAME|
+---+----+----------+
| 1| har| HARRY|
| 2| ron| RONALD|
| 3|fred| null|
+---+----+----------+
If you do not apply an alias to the dataframe, you'll receive an error after you create your joined dataframe. With two columns named the same thing, referencing one of the duplicate named columns returns an error that essentially says it doesn't know which one you selected (Ambiguous). In SQL Server and other languages, the SQL engine wouldn't let that query go through or it would automatically append a prefix or suffix to that field name.
we can select the required fields in the sql query like below one
spark.sql("select A.ID,A.NAME,B.ACTUALNAME from table1 A LEFT OUTER JOIN table2 B ON table1.NAME = table2.NAME").show()
This is mostly an academic exercise, but you can also do it without the need to drop columns by switching on the ability of Spark SQL to interpret regular expressions in quoted identifiers, an ability inherited from Hive SQL. You need to set spark.sql.parser.quotedRegexColumnNames
to true
when building the Spark context for this to work.
$ spark-shell --master "local[*]" --conf spark.sql.parser.quotedRegexColumnNames=true
...
scala> spark.sql("select table1.*, table2.`^(?!NAME$).*$` from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME").show()
+---+----+----------+
| ID|NAME|ACTUALNAME|
+---+----+----------+
| 1| har| HARRY|
| 2| ron| RONALD|
| 3|fred| null|
+---+----+----------+
Here
table2.`^(?!NAME$).*$`
resolves to all columns of table2
except NAME
. Any valid Java regular expression should work.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.