简体   繁体   中英

Adding vectors present in two different RDDs scala spark

I have two RDDs with this structure

org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)]

Here each row of RDD contains an index Long and a vector org.apache.spark.mllib.linalg.Vector . I want to add each component of the Vector into the corresponding component of other Vector present in a row of other RDD. Each vector of first RDD should be added to each vector of other RDD.

An example would look like this:

RDD1:

Array[(Long, org.apache.spark.mllib.linalg.Vector)] = 
      Array((0,[0.1,0.2]),(1,[0.3,0.4]))

RDD2:

Array[(Long, org.apache.spark.mllib.linalg.Vector)] = 
      Array((0,[0.3,0.8]),(1,[0.2,0.7]))

Result:

Array[(Long, org.apache.spark.mllib.linalg.Vector)] = 
Array((0,[0.4,1.0]),(0,[0.3,0.9]),(1,[0.6,1.2]),(1,[0.5,1.1]))

Please consider the same situation using List instead of Array.

Here is my solution:

    val l1 = List((0,List(0.1,0.2)),(1,List(0.1,0.2)))
    val l2 = List((0,List(0.3,0.8)),(1,List(0.2,0.7)))
    var sms = (l1 zip l2).map{ case (m, a) => (m._1, (m._2, a._2).zipped.map(_+_))}

Let's experiment with Array :)

Instead of driver code you can do all this in transformation . This will be helpful if you have large rdds. This will perform less shuffling too.

val a:RDD[(Long, org.apache.spark.mllib.linalg.Vector)]= sc.parallelize(Array((0l,Vectors.dense(0.1,0.2)),(1l,Vectors.dense(0.3,0.4))))

val b:RDD[(Long, org.apache.spark.mllib.linalg.Vector)]= sc.parallelize(Array((0l,Vectors.dense(0.3,0.8)),(1l,Vectors.dense(0.2,0.7))))

val ab= a join b

val result=ab.map(x => (x._1,Vectors.dense(x._2._1.apply(0)+x._2._2.apply(0),x._2._1.apply(1)+x._2._2.apply(1))))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM