简体   繁体   中英

Joining two RDDs column in Apache Spark

This is the already asked question but I could not understand the answers properly.

I have two RDDs with same number of columns and same number of records

RDD1(col1,col2,col3)

and

RDD2(colA,colB,colC)

I need to join them as following :

RDD_FINAL(col1,col2,col3,colA,colB,colC)

There is no key to perform a join between records but they are in order which means the first record of RDD1 is corresponded to first record of RDD2.

您可以使用zipWithIndex方法将行的索引添加为两个 RDD 的键,并通过键连接它。

Adding code snippet for Alfilercio's example.

JavaRDD<col1,col2,col3> rdd1 = ...
JavaPairRDD<Long, Tuple3<col1,col2,col3>> pairRdd1 = rdd1.zipWithUniqueId().mapToPair(pair -> new Tuple2<>(pair._2(),pair._1());

JavaRDD<colA,colB,colC> rdd2 = ...
JavaPairRDD<Long, Tuple3<colA,colB,colC>> pairRdd2 = rdd2.zipWithUniqueId().mapToPair(pair -> new Tuple2<>(pair._2(),pair._1());

JavaRDD<Tuple2<Tuple3<col1, col2, col3>, Tuple3<colA,colB,colC>>> mappedRdd = pairRdd1.join(pairRdd2).map(pair -> pair._2());

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM