简体   繁体   中英

How can I extract the values that don't match when joining two RDD's in Spark?

I have two sets of RDD's that look like this:

rdd1 = [(12, abcd, lmno), (45, wxyz, rstw), (67, asdf, wert)]
rdd2 = [(12, abcd, lmno), (87, whsh, jnmk), (45, wxyz, rstw)]

I need to create a new RDD that has all the values found in rdd2 that don't have corresponding matches in rdd1 . So the created RDD should contain the following data:

rdd3 = [(87, whsh, jnmk)]

Does anyone know how to accomplish this?

You can do a full join and then create 2 new RDDs.

  1. Select where both tables had records
  2. Select where RDD2 PK is present and RDD1 is null

You'll need to first convert them to KV rdds. Sample code below: rdd3 = rdd1.fullJoin(rdd2).filter(x => x._3.exists).map(x => (x._1, x._3.get))

(Yes, there is a more idiomatic way to get the option but this should work)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM