简体   繁体   中英

How to merge two RDDs in Spark if value stored at a key matches

Lets say I have 2 RDDs :

rdd1 = [ (key1, value1), (key2, value2), (key3, value3) ]

rdd2 = [ (key4, value4), (key5, value5), (key6, value6) ]

And I want to merge the rdds if and only if the value stored at key1 in rdd1 == the value stored at key5 in rdd2.

How would I go about doing that in Spark using Java or Scala?

I think you are looking for a join.

First thing you'd need to do is mapping them to PairRDDs, with key1, key2, etc as keys. This example uses Tuple2 as input:

JavaPairRDD<Integer, String> pairRdd = rdd.mapToPair(new PairFunction<Tuple2<Integer, String>, Integer, String>() {
    public Tuple2<Integer, String> call(Tuple2<Integer, String> val) throws Exception {
        return new Tuple2<Integer, String>(val._1(), val._2());
    }
});

Once you map both, you just need to join them by key:

JavaPairRDD<Integer, Tuple2<String, String>> combined = pairRdd.join(pairRdd2);

Then, combined will be something like:

[ (key1, (value1, value5)), (key2, (value2, value4)) ]

Where key1 == key5 and key2 == key4

I give you the solution in scala spark as below

scala> val rdd1 = sc.parallelize(List((3,"s"),(2,"df"),(1,"i")))
scala> val rdd2 = sc.parallelize(List((1,"ds"),(2,"h"),(1,"i")))
scala> val swaprdd1=rdd1.map(_.swap)
scala> val swaprdd2=rdd2.map(_.swap)
scala> val intersectrdd = rdd1.intersection(rdd2)
scala> val resultrdd = intersectrdd.map(_.swap)

I hope its helpful for your solution :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM