Lets say I have 2 RDDs
:
rdd1 = [ (key1, value1), (key2, value2), (key3, value3) ]
rdd2 = [ (key4, value4), (key5, value5), (key6, value6) ]
And I want to merge the rdds if and only if the value stored at key1 in rdd1 == the value stored at key5 in rdd2.
How would I go about doing that in Spark using Java or Scala?
I think you are looking for a join.
First thing you'd need to do is mapping them to PairRDDs, with key1, key2, etc as keys. This example uses Tuple2 as input:
JavaPairRDD<Integer, String> pairRdd = rdd.mapToPair(new PairFunction<Tuple2<Integer, String>, Integer, String>() {
public Tuple2<Integer, String> call(Tuple2<Integer, String> val) throws Exception {
return new Tuple2<Integer, String>(val._1(), val._2());
}
});
Once you map both, you just need to join them by key:
JavaPairRDD<Integer, Tuple2<String, String>> combined = pairRdd.join(pairRdd2);
Then, combined will be something like:
[ (key1, (value1, value5)), (key2, (value2, value4)) ]
Where key1 == key5 and key2 == key4
I give you the solution in scala spark as below
scala> val rdd1 = sc.parallelize(List((3,"s"),(2,"df"),(1,"i")))
scala> val rdd2 = sc.parallelize(List((1,"ds"),(2,"h"),(1,"i")))
scala> val swaprdd1=rdd1.map(_.swap)
scala> val swaprdd2=rdd2.map(_.swap)
scala> val intersectrdd = rdd1.intersection(rdd2)
scala> val resultrdd = intersectrdd.map(_.swap)
I hope its helpful for your solution :)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.