简体   繁体   中英

Left outer Complex Join of Spark DataFrames using Seq(“key”) syntax

I need to convert the below sql join using dataframes . The problem with this is I am getting duplicate "key" columns

val result_sql = sparkSession.sql(" select * from TAB_A a left outer join TAB_B b on a.key = b.key AND a.e_date between b.start_date and b.end_date ")

result_sql.printSchema()

root
|-- key: string (nullable = true)
|-- key: string (nullable = true)
|-- VAL: double (nullable = true)

So I have tried this , but landed up with the same duplicate column "key"

val result_df = TAB_A.join(TAB_B,TAB_A.col("key") === TAB_B.col("key")
                             && TAB_A.col("e_date").between(TAB_B.col("start_date"),TAB_B.col("start_date")),
                        "left_outer")

root
|-- key: string (nullable = true)
|-- key: string (nullable = true)
|-- VAL: double (nullable = true)

Then I have tried using Seq , but unable to implement the complex join and facing errors

val result_df = TAB_A.join(TAB_B,Seq("key") && TAB_A.col("e_date").between(TAB_B.col("start_date"),TAB_B.col("start_date")),
                        "left_outer")

Expected Schema :

root
|-- key: string (nullable = true)
|-- VAL: double (nullable = true)

Any best solution to implement the above logic without duplicate columns.

Note : I am looking for solution using spark dataframes instead of spark_sql query.

The problem of the SQL is that, the result has two columns (key) with the same name from the two join tables.

Solution #1 assign different names to the keys.
eg set the column name of the left table to be k1
set the column name of the right table to be k2

Solution #2 Specify the columns you want to keep in the result table

SELECT a.*, b.val1, b.val2
FROM TAB_A a left outer join TAB_B b on a.key = b.key AND a.e_date between b.start_date and b.end_date 


// Since you you only want to keep one key, please change the code you have
val result_df = TAB_A.join(TAB_B,TAB_A.col("key") === TAB_B.col("key")
                         && TAB_A.col("e_date").between(TAB_B.col("start_date"),TAB_B.col("start_date")),
                    "left_outer")
// drop the key from TAB_B or TAB_A
val result_df = TAB_A.join(TAB_B,TAB_A.col("key") === TAB_B.col("key")
                         && TAB_A.col("e_date").between(TAB_B.col("start_date"),TAB_B.col("start_date")),
                    "left_outer").drop(TAB_B("key"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM