简体   繁体   中英

Joining two dataframes in spark to return only one match

I have two dataframes like this:

DF1:

id  |  name
---
1   |   abc

2   |   xyz

DF2:

id  |  course
---
1   |  c1

1   |  c2

1   |  c3

2   |  c1

2   |  c3

When I do a left_outer or inner join of df1 and df2, I want the resultant dataframe to come as:

id  | name |  course
--- 
1   | abc  | c1
---
2   | xyz  | c1
---

It doesn't matter whether it is c1,c2 or c3 for id 1 when I join; but I need only one record.

Please let me know how can I achieve this in spark.

Thanks, John

How about dropping all duplicated records based on column id which will keep only one record for each unique id and then join it with df1 :

df1.join(df2.dropDuplicates(Seq("id")), Seq("id"), "inner").show

+---+----+------+
| id|name|course|
+---+----+------+
|  1| abc|    c1|
|  2| xyz|    c1|
+---+----+------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM