简体   繁体   中英

Aggregate key-values in order in spark scala

I'm trying to implement a distributed singular-value-decomposition of a matrix A in spark (Scala). I have managed to compute all the elements of the product At*A as transformations on RDD (At the transpose of A) and have it as an RDD of the form RDD[(Int,Int),Double)]

Array(((0,0),66.0), ((0,2),90.0), ((1,0),78.0), ((1,2),108.0), ((2,1),108.0), ((0,1),78.0), ((1,1),93.0), ((2,2),126.0), ((2,0),90.0))

where the key (j,k) indicates on what row and column in the matrix At*A the value should be. In the end I would like to have the rows as a dense matrix (but I'm open to other suggestions).

I tried to use aggregateByKey like this on the first part on the tuple (which indicates on which row in the matrix the value should be):

aggregateByKey(new HashSet[Double])(_+_,_++_)

but then I don't get the elements in the right order in the row in the final matrix.

Is there any good way to do this? I post the code below so perhaps it might be useful.

Thank you and kind regards.

import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}

var m = sc.parallelize(Array((1,2,3),(4,5,6),(7,8,9)))  

import scala.collection.mutable.ArrayBuffer


//Function that maps an indexed row (a_1,...,a_n) to   ((j,k),a_j*a_k)
def f(v: IndexedRow): Array[((Int,Int),Double)]={
var keyvaluepairs = ArrayBuffer[((Int,Int),Double)]()
for(j<-0 to v.vector.size-1){
  for(k<-0 to v.vector.size-1){
  keyvaluepairs.append(((j,k),v.vector(j)*v.vector(k)))
  }
}
keyvaluepairs.toArray
}

//map M to key-value rdd where key =(j,k) and value = a_ij*a_ik.
val keyvalRDD = A.flatMap(row =>f(row))


//Sum up all key-value pairs that have the same key (j,k) (corresponts to getting the element of A.T*A on the j:th row and k:th column).
val keyvalSum = keyvalRDD.reduceByKey((x,y)=>x+y)

val rowkeySum = keyvalSum.map(x=>(x._1._2,x._2))  // The keys are of the form (j,k). just save the index that indicate of which row it should be in the matrix.

import scala.collection.immutable.HashSet
val mergeRows = rowkeySum.aggregateByKey(new HashSet[Double])(_+_,_++_)

import breeze.linalg.{Vector,DenseMatrix}

val Rows = mergeRows.map(x=>x._2.toArray)

//Throw away the keys, turn the rows to Arrays and collect.

val dm = DenseMatrix(Rows:_*)

Try to build the matrix with a coordinate matrix:

  def calculate(sc: SparkContext) = {
    val matrix =
      sc.parallelize(Array(((0,0),66.0), ((0,2),90.0), ((1,0),78.0), ((1,2),108.0), ((2,1),108.0), ((0,1),78.0), ((1,1),93.0), ((2,2),126.0), ((2,0),90.0)))
      .map(el => MatrixEntry(el._1._1, el._1._2, el._2))

    var i = 0
    val mat = new CoordinateMatrix(matrix)

    val m = mat.numRows()
    val n = mat.numCols()
    val result = DenseMatrix.zeros[Double](m.toInt,n.toInt)

    mat.toRowMatrix().rows.collect().foreach { vec =>
      vec.foreachActive { case (index, value) =>
        result(i, index) = value
      }
      i += 1
    }

    println("Result: " + result)
  }

The result:

66.0  78.0   90.0   
78.0  93.0   108.0  
90.0  108.0  126.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM