简体   繁体   English

Spark SQL QUERY join on Same column name

[英]Spark SQL QUERY join on Same column name

I am writing a join query for 2 dataframes.我正在为 2 个数据帧编写连接查询。 I have to perform join on column which has same name in both dataframes.我必须对两个数据框中具有相同名称的列执行连接。 How can I write it in Query?如何在 Query 中编写它?

var df1 = Seq((1,"har"),(2,"ron"),(3,"fred")).toDF("ID", "NAME")
var df2 = Seq(("har", "HARRY"),("ron", "RONALD")).toDF("NAME", "ACTUALNAME")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")

I know we can do df3 = df1.join(df2, Seq("NAME")) where NAME is the common column.我知道我们可以做df3 = df1.join(df2, Seq("NAME"))其中NAME是公共列。 In this scenario df3 will have only ID, NAME, ACTUALNAME .在这种情况下, df3将只有ID, NAME, ACTUALNAME

If we do it from SQL then query will be select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME .如果我们从 SQL 执行此操作,则查询将是select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME For this output dataframe will have ID, NAME, NAME, ACTUALNAME columns.对于这个 output dataframe 将有ID, NAME, NAME, ACTUALNAME列。 How can I remove extra NAME column which came from df2 .如何删除来自df2的额外NAME列。

This does not work as well spark.sql("select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME").drop(df2("NAME"))这也不起作用spark.sql("select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME").drop(df2("NAME"))

Is there a cleaner way to do this?有没有更清洁的方法来做到这一点? Renaming df2 columns is the last option which I don't want to use.重命名df2列是我不想使用的最后一个选项。 I have scenario where creating SQL queries is easier than dataframes so looking for only SPARK SQL Specific answers我有创建 SQL 查询比数据框更容易的情况,所以只寻找 SPARK SQL 具体答案

try this you can use col() for referring column试试这个,你可以使用 col() 来引用列

scala> spark.sql("select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME").drop(col("table2.NAME")).show()
+---+----+----------+
| ID|NAME|ACTUALNAME|
+---+----+----------+
|  1| har|     HARRY|
|  2| ron|    RONALD|
|  3|fred|      null|
+---+----+----------+

If you do not apply an alias to the dataframe, you'll receive an error after you create your joined dataframe.如果您不对 dataframe 应用别名,您将在创建加入的 dataframe 后收到错误消息。 With two columns named the same thing, referencing one of the duplicate named columns returns an error that essentially says it doesn't know which one you selected (Ambiguous).如果两列命名相同,引用其中一个重复的命名列会返回一个错误,该错误基本上表示它不知道您选择了哪一个(不明确)。 In SQL Server and other languages, the SQL engine wouldn't let that query go through or it would automatically append a prefix or suffix to that field name. In SQL Server and other languages, the SQL engine wouldn't let that query go through or it would automatically append a prefix or suffix to that field name.

we can select the required fields in the sql query like below one我们可以 select sql 查询中的必填字段,如下所示

spark.sql("select A.ID,A.NAME,B.ACTUALNAME from table1 A LEFT OUTER JOIN table2 B ON table1.NAME = table2.NAME").show()

This is mostly an academic exercise, but you can also do it without the need to drop columns by switching on the ability of Spark SQL to interpret regular expressions in quoted identifiers, an ability inherited from Hive SQL.这主要是一个学术练习,但您也可以通过打开 Spark SQL 解释带引号标识符中的正则表达式的能力来完成此操作而无需删除列,这是从 Hive Z9778840A0101430C982ZB05.A 继承的能力。 You need to set spark.sql.parser.quotedRegexColumnNames to true when building the Spark context for this to work.在构建 Spark 上下文时,您需要将spark.sql.parser.quotedRegexColumnNames设置为true才能正常工作。

$ spark-shell --master "local[*]" --conf spark.sql.parser.quotedRegexColumnNames=true
...
scala> spark.sql("select table1.*, table2.`^(?!NAME$).*$` from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME").show()
+---+----+----------+
| ID|NAME|ACTUALNAME|
+---+----+----------+
|  1| har|     HARRY|
|  2| ron|    RONALD|
|  3|fred|      null|
+---+----+----------+

Here这里

table2.`^(?!NAME$).*$`

resolves to all columns of table2 except NAME .解析除NAME之外的table2的所有列。 Any valid Java regular expression should work.任何有效的 Java 正则表达式都应该有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM