I have two Spark DataFrames where one of them has two cols, id and Tag. A second DataFrame has an id col, but missing the Tag. The first Dataframe is essentially a dictionary, each id appears once, while in the second DataFrame and id may appear several times. What I need is to create a new col in the second DataFrame that has the Tag as a function of the id in each row (in the second DataFrame). I think this can be done by converting to RDDs first ..etc, but I thought there must be a more elegant way using DataFrames (in Java). Example: given a df1 Row-> id: 0, Tag: "A" , a df2 Row1-> id: 0, Tag: null , a df2 Row2-> id: 0, Tag: "B" , I need to create a Tag col in the resulting DataFrame df3 equal to df1(id=0) = "A" IF df2 Tag was null, but keep original Tag if not null => resulting in df3 Row1-> id: 0, Tag: "A" , df3 Row2-> id: 0, Tag: "B" . Hope the example is clear.
| ID | No. | Tag | new Tag Col |
| 1 | 10002 | A | A |
| 2 | 10003 | B | B |
| 1 | 10004 | null | A |
| 2 | 10005 | null | B |
All you need here is left outer join and coalesce
:
import org.apache.spark.sql.functions.coalesce
val df = sc.parallelize(Seq(
(1, 10002, Some("A")), (2, 10003, Some("B")),
(1, 10004, None), (2, 10005, None)
)).toDF("id", "no", "tag")
val lookup = sc.parallelize(Seq(
(1, "A"), (2, "B")
)).toDF("id", "tag")
df.join(lookup, df.col("id").equalTo(lookup.col("id")), "leftouter")
.withColumn("new_tag", coalesce(df.col("tag"), lookup.col("tag")))
This should almost identical to Java version.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.