How can I extract the values that don't match when joining two RDD's in Spark?

Question

I have two sets of RDD's that look like this:

rdd1 = [(12, abcd, lmno), (45, wxyz, rstw), (67, asdf, wert)]
rdd2 = [(12, abcd, lmno), (87, whsh, jnmk), (45, wxyz, rstw)]

I need to create a new RDD that has all the values found in rdd2 that don't have corresponding matches in rdd1 . So the created RDD should contain the following data:

rdd3 = [(87, whsh, jnmk)]

Does anyone know how to accomplish this?

Answer 1

You can do a full join and then create 2 new RDDs.

Select where both tables had records
Select where RDD2 PK is present and RDD1 is null

You'll need to first convert them to KV rdds. Sample code below: rdd3 = rdd1.fullJoin(rdd2).filter(x => x._3.exists).map(x => (x._1, x._3.get))

(Yes, there is a more idiomatic way to get the option but this should work)

How can I extract the values that don't match when joining two RDD's in Spark?

Question

1 answers

solution1
-1 2016-07-18 18:33:57

How can I extract the values that don't match when joining two RDD's in Spark?

Question

1 answers

solution1 -1 2016-07-18 18:33:57

solution1
-1 2016-07-18 18:33:57