Spark job slow on cluster than standalone

Question

I have this piece of code which works fine in standalone but works slowly when it comes to work on a cluster of 4 slaves (8cores 30Go memory) at AWS.

For a file of 10000 entries
Standalone : 257s
Aws 4S : 369s

    def tabHash(nb:Int, dim:Int) = {

        var tabHash0 = Array(Array(0.0)).tail

        for( ind <- 0 to nb-1) {
            var vechash1 = Array(0.0).tail
            for( ind <- 0 to dim-1) {
                val nG = Random.nextGaussian
                vechash1 = vechash1 :+ nG
            }
            tabHash0 = tabHash0 :+ vechash1
        }
        tabHash0
    }

    def hashmin3(x:Vector, w:Double, b:Double, tabHash1:Array[Array[Double]]) = {

        var tabHash0 = Array(0.0).tail
        val x1 = x.toArray
        for( ind <- 0 to tabHash1.size-1) {
            var sum = 0.0
            for( ind2 <- 0 to x1.size-1) {
                sum = sum + (x1(ind2)*tabHash1(ind)(ind2))
            }           
            tabHash0 =  tabHash0 :+  (sum+b)/w
        }
        tabHash0

    }

    def pow2(tab1:Array[Double], tab2:Array[Double]) = {

        var sum = 0.0
        for( ind <- 0 to tab1.size-1) {
            sum = sum - Math.pow(tab1(ind)-tab2(ind),2)
        }
        sum
    }


        val w = ww
        val b = Random.nextDouble * w
        val tabHash2 = tabHash(nbseg,dim)

        var rdd_0 = parsedData.map(x => (x.get_id,(x.get_vector,hashmin3(x.get_vector,w,b,tabHash2)))).cache

        var rdd_Yet = rdd_0

        for( ind <- 1 to maxIterForYstar  ) {

            var rdd_dist = rdd_Yet.cartesian(rdd_0).flatMap{ case (x,y) => Some((x._1,(y._2._1,pow2(x._2._2,y._2._2))))}//.coalesce(64)

            var rdd_knn = rdd_dist.topByKey(k)(Ordering[(Double)].on(x=>x._2))

            var rdd_bary = rdd_knn.map(x=> (x._1,Vectors.dense(bary(x._2,k))))

            rdd_Yet = rdd_bary.map(x=>(x._1,(x._2,hashmin3(x._2,w,b,tabHash2))))


        }

I tried to broadcast some variables

        val w = sc.broadcast(ww.toDouble)
        val b = sc.broadcast(Random.nextDouble * ww)
        val tabHash2 = sc.broadcast(tabHash(nbseg,dim))

Without any effects

I know that's not the bary function because i tried another version of this code without hashmin3 which works fine with 4 slaves and worse with 8 slaves which is for another topic.

Answer 1

Bad code. Especially for distributed and large computations. I can't fast tell what is root of problem, but you anyway must rewrite this code.

Array is terrible for universal and sharable data. It is mutable and require continious memory allocation (last may be a problem even you have enough of memory). Better use Vector (or List sometimes). Never use arrays, really.
var vechash1 = Array(0.0).tail You create collection with one element, then call function to get empty collection. If it's rare no worry about performance, but it's ugly! var vechash1: Array[Double] = Array() or var vechash1: Vector[Double] = Vector() or var vechash1 = Vector.empty[Double] .
def tabHash(nb:Int, dim:Int) = Always set return type of function when it's unclear. Power of scala is rich type system. It's very helpful to have compile time checks (about what you exactly get in result but not what you imagine to get!). It's very important when deal with huge data, because compile checks save your time and money. Also it's more easy to read such code later. def tabHash(nb:Int, dim:Int): Vector[Vector[Double]] =
def hashmin3(x: Vector, typo? it doesn't compile without type parameter.

First function more compact:

def tabHash(nb:Int, dim:Int): Vector[Vector[Double]] = {
  (0 to nb-1).map {_ =>
    (0 to dim - 1).map(_ => Random.nextGaussian()).toVector
  }.toVector
}

Second function is ((x*M) + scalar_b)/scalar_w . It's may be more efficient to use library that is specifically optimized for work with matrices.

Third (i suppose mistake here with sign of computation, if you count square error):

def pow2(tab1:Vector[Double], tab2:Vector[Double]): Double = 
      tab1.zip(tab2).map{case (t1,t2) => Math.pow(t1 - t2, 2)}.reduce(_ - _)

var rdd_Yet = rdd_0 Cached RDD is rewrited in cycle. So it's useless storage.

Last cycle is hard to analyse. I think it's must be simplified.

Spark job slow on cluster than standalone

Question

1 answers

solution1
1 2015-07-21 19:12:14

Spark job slow on cluster than standalone

Question

1 answers

solution1 1 2015-07-21 19:12:14

solution1
1 2015-07-21 19:12:14