简体   繁体   中英

Join two dataframes and replace the original column values using Spark Scala

I have two DFs

df1:

+---+-----+--------+
|key|price|    date|
+---+-----+--------+
|  1|  1.0|20210101|
|  2|  2.0|20210101|
|  3|  3.0|20210101|
+---+-----+--------+

df2:

+---+-----+
|key|price|
+---+-----+
|  1|  1.1|
|  2|  2.2|
|  3|  3.3|
+---+-----+

I'd like to replace price column values from df1 with price values from df2 where df1.key == df2.key

Expected output:

+---+-----+--------+
|key|price|    date|
+---+-----+--------+
|  1|  1.1|20210101|
|  2|  2.1|20210101|
|  3|  3.3|20210101|
+---+-----+--------+

I've found some solutions in python but I couldn't come up with a working solution in Scala.

Simply join + drop df1 column price:

val df = df1.join(df2, Seq("key")).drop(df1("price"))

df.show
//+---+-----+--------+
//|key|price|    date|
//+---+-----+--------+
//|  1|  1.1|20210101|
//|  2|  2.2|20210101|
//|  3|  3.3|20210101|
//+---+-----+--------+

Or if you have more entries in df1 and you want to keep their price when there is no match in df2 then use left join + coalesce expression:

val df = df1.join(df2, Seq("key"), "left").select(
  col("key"),
  col("date"),
  coalesce(df2("price"), df1("price")).as("price")
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM