I have two dataframes like this:
DF1:
id | name
---
1 | abc
2 | xyz
DF2:
id | course
---
1 | c1
1 | c2
1 | c3
2 | c1
2 | c3
When I do a left_outer or inner join of df1 and df2, I want the resultant dataframe to come as:
id | name | course
---
1 | abc | c1
---
2 | xyz | c1
---
It doesn't matter whether it is c1,c2 or c3 for id 1 when I join; but I need only one record.
Please let me know how can I achieve this in spark.
Thanks, John
How about dropping all duplicated records based on column id
which will keep only one record for each unique id
and then join it with df1
:
df1.join(df2.dropDuplicates(Seq("id")), Seq("id"), "inner").show
+---+----+------+
| id|name|course|
+---+----+------+
| 1| abc| c1|
| 2| xyz| c1|
+---+----+------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.