Scala fast way to parallelize collection

Question

My code is equivalent to this:

def iterate(prev: Vector[Int], acc: Int): Vector[Int] = {
  val next = (for { i <- 1.to(1000000) }
    yield (prev(Random.nextInt(i))) ).toVector

  if (acc < 20) iterate(next, acc + 1)
  else next
}
iterate(1.to(1000000).toVector, 1)

For a large number of iterations, it does an operation on a collection, and yields the value. At the end of the iterations, it converts everything to a vector. Finally, it proceeds to the next recursive self-call, but it cannot proceed until it has all the iterations done. The number of the recursive self-calls is very small.

I want to paralellize this, so I tried to use .par on the 1.to(1000000) range. This used 8 processes instead of 1, and the result was only twice faster! .toParArray was only slightly faster than .par . I was told it could be much faster if I used something different, like maybe ThreadPool - this makes sense, because all of the time is spent in constructing next , and I assume that concatenating the outputs of different processes onto shared memory will not result in huge slowdowns, even for very large outputs (this is a key assumption and it might be wrong). How can I do it? If you provide code, paralellizing the code I gave will be sufficient.

Note that the code I gave is not my actual code. My actual code is much more long and complex (Held-Karp algorithm for TSP with constraints, BitSets and more stuff), and the only notable difference is that in my code, prev 's type is ParMap , instead of Vector .

Edit, extra information: the ParMap has 350k elements on the worst iteration at the biggest sample size I can handle, and otherwise it's typically 5k-200k (that varies on a log scale). If it inherently needs a lot of time to concatenate the results from the processes into one single process (I assume this is what's happening), then there is nothing much I can do, but I rather doubt this is the case.

Answer 1

Implemented few versions after the original, proposed in the question,

rec0 is the original with a for loop;
rec1 uses par.map instead of for loop;
rec2 follows rec1 yet it employs parallel collection ParArray for lazy builders (and fast access on bulk traversal operations) ;
rec3 is a non-idiomatic non-parallel version with mutable ArrayBuffer .

Thus

import scala.collection.mutable.ArrayBuffer
import scala.collection.parallel.mutable.ParArray
import scala.util.Random

// Original
def rec0() = {
  def iterate(prev: Vector[Int], acc: Int): Vector[Int] = {
    val next = (for { i <- 1.to(1000000) }
      yield (prev(Random.nextInt(i))) ).toVector

    if (acc < 20) iterate(next, acc + 1)
    else next
  }
  iterate(1.to(1000000).toVector, 1)
}

//  par map
def rec1() = {
  def iterate(prev: Vector[Int], acc: Int): Vector[Int] = {
    val next = (1 to 1000000).par.map { i => prev(Random.nextInt(i)) }.toVector

    if (acc < 20) iterate(next, acc + 1)
    else next
  }
  iterate(1.to(1000000).toVector, 1)
}

// ParArray par map
def rec2() = {
  def iterate(prev: ParArray[Int], acc: Int): ParArray[Int] = {
    val next = (1 to 1000000).par.map { i => prev(Random.nextInt(i)) }.toParArray

    if (acc < 20) iterate(next, acc + 1)
    else next
  }
  iterate((1 to 1000000).toParArray, 1).toVector
}

// Non-idiomatic non-parallel
def rec3() = {
  def iterate(prev: ArrayBuffer[Int], acc: Int): ArrayBuffer[Int] = {

    var next = ArrayBuffer.tabulate(1000000){i => i+1}
    var i = 0
    while (i < 1000000) {
      next(i) = prev(Random.nextInt(i+1))
      i = i + 1
    }

    if (acc < 20) iterate(next, acc + 1)
    else next
  }
  iterate(ArrayBuffer.tabulate(1000000){i => i+1}, 1).toVector
}

Then a little testing on averaging elapsed times,

def elapsed[A] (f: => A): Double = {
  val start = System.nanoTime()
  f
  val stop = System.nanoTime()
  (stop-start)*1e-6d
}

val times = 10
val e0 = (1 to times).map { i => elapsed(rec0) }.sum / times
val e1 = (1 to times).map { i => elapsed(rec1) }.sum / times
val e2 = (1 to times).map { i => elapsed(rec2) }.sum / times
val e3 = (1 to times).map { i => elapsed(rec3) }.sum / times

// time in ms.
e0: Double = 2782.341
e1: Double = 2454.828
e2: Double = 3455.976
e3: Double = 1275.876

shows that the non-idiomatic non-parallel version proves the fastest in average. Perhaps for larger input data, the parallel, idiomatic versions may be beneficial.

Scala fast way to parallelize collection

Question

1 answers

solution1
0 2014-04-04 06:43:04

Scala fast way to parallelize collection

Question

1 answers

solution1 0 2014-04-04 06:43:04

solution1
0 2014-04-04 06:43:04