Joining two dataframes in spark to return only one match

Question

I have two dataframes like this:

DF1:

id  |  name
---
1   |   abc

2   |   xyz

DF2:

id  |  course
---
1   |  c1

1   |  c2

1   |  c3

2   |  c1

2   |  c3

When I do a left_outer or inner join of df1 and df2, I want the resultant dataframe to come as:

id  | name |  course
--- 
1   | abc  | c1
---
2   | xyz  | c1
---

It doesn't matter whether it is c1,c2 or c3 for id 1 when I join; but I need only one record.

Please let me know how can I achieve this in spark.

Thanks, John

Answer 1

How about dropping all duplicated records based on column id which will keep only one record for each unique id and then join it with df1 :

df1.join(df2.dropDuplicates(Seq("id")), Seq("id"), "inner").show

+---+----+------+
| id|name|course|
+---+----+------+
|  1| abc|    c1|
|  2| xyz|    c1|
+---+----+------+

Joining two dataframes in spark to return only one match

Question

1 answers

solution1
3 ACCPTED 2017-01-13 20:33:00

Joining two dataframes in spark to return only one match

Question

1 answers

solution1 3 ACCPTED 2017-01-13 20:33:00

solution1
3 ACCPTED 2017-01-13 20:33:00