简体   繁体   中英

How to combine two RDD whith different keys in java Spark?

Suppose I have One RDD of Tuple2 like below:

<session1_w1, <0.2, 2>>, 
<session1_w2, <1.3, 4>>, 
<session1_w3, <0.4, 3>>, 
<session2_w1, <0.5, 2>>, 
<session2_w2, <2.3, 6>>

I need to map it to the following RDD, such that the last field is the summation of the last fields of the tuples with the same partial key value eg session1 :

2 + 4 + 3 => 9   
2 + 6 => 8

So The result that I expect is:

<session1_w1, 0.2, 9>, 
<session1_w2, 1.3, 9>, 
<session1_w3, 0.4, 9>, 
<session2_w1, <0.5, 8>>, 
<session2_w2, <2.3, 8>>

It is some kind of reduction, but I do not want to lose the original keys.

I can calculate the summation by mapping and then reducing to the following RDD, but then I need to merge this RDD with the first RDD to obtain the result.

<session1, 9> <session2, 8>

Any idea ?

You use groupBy which preserves the structure of your RDD (but it does not preserve the order, so if you want to save the ordering, you must zipWithIndex and later sortBy the index).

Otherwise if you have RDD[(String,(Double,Int))] :

// This should give you an RDD[(Iterative(String,Double),Int)]
val group = myRDD.groupBy(_._1).map(x => (x._2.map(y => (y._1,y._2._1)),
                                          x._2.map(y => y._2._2).reduce(_+_))) 

// This will give you back your RDD of [Summed Int, String, Double] which you can then map.
val result = group.map(x => (x._2,x._1)).flatMapValues(x => x)

You can also do a simple reduceByKey (without the Double), and later join it back to original RDD such that the original Doubles are preserved.

==========EDIT============

The second join solution simply uses the RDD join. You have your original RDD in the format of RDD[(String,(Double,Int))] and I presume you have already obtained your RDD of [(String,Int)] where String is session, Int is the sum. The join operation is simply:

RDDOriginal.join(RDDwithSum).map(x=>(x._1,x._2._1._1,x._2._2)) // This should give you the Session (String) followed by the Double and the Int (the sum).

The join method also does not preserve the order, if you want to keep the order, you will have to do zipWithIndex.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM