How to merge two RDDs in Spark if value stored at a key matches

Question

Lets say I have 2 RDDs :

rdd1 = [ (key1, value1), (key2, value2), (key3, value3) ]

rdd2 = [ (key4, value4), (key5, value5), (key6, value6) ]

And I want to merge the rdds if and only if the value stored at key1 in rdd1 == the value stored at key5 in rdd2.

How would I go about doing that in Spark using Java or Scala?

Answer 1

I think you are looking for a join.

First thing you'd need to do is mapping them to PairRDDs, with key1, key2, etc as keys. This example uses Tuple2 as input:

JavaPairRDD<Integer, String> pairRdd = rdd.mapToPair(new PairFunction<Tuple2<Integer, String>, Integer, String>() {
    public Tuple2<Integer, String> call(Tuple2<Integer, String> val) throws Exception {
        return new Tuple2<Integer, String>(val._1(), val._2());
    }
});

Once you map both, you just need to join them by key:

JavaPairRDD<Integer, Tuple2<String, String>> combined = pairRdd.join(pairRdd2);

Then, combined will be something like:

[ (key1, (value1, value5)), (key2, (value2, value4)) ]

Where key1 == key5 and key2 == key4

Answer 2

I give you the solution in scala spark as below

scala> val rdd1 = sc.parallelize(List((3,"s"),(2,"df"),(1,"i")))
scala> val rdd2 = sc.parallelize(List((1,"ds"),(2,"h"),(1,"i")))
scala> val swaprdd1=rdd1.map(_.swap)
scala> val swaprdd2=rdd2.map(_.swap)
scala> val intersectrdd = rdd1.intersection(rdd2)
scala> val resultrdd = intersectrdd.map(_.swap)

I hope its helpful for your solution :)

How to merge two RDDs in Spark if value stored at a key matches

Question

2 answers

solution1
1 2016-07-06 21:04:52

solution2
1 2016-07-07 09:56:37

How to merge two RDDs in Spark if value stored at a key matches

Question

2 answers

solution1 1 2016-07-06 21:04:52

solution2 1 2016-07-07 09:56:37

solution1
1 2016-07-06 21:04:52

solution2
1 2016-07-07 09:56:37