Issues Aggregating Spark Datasets in Scala

Question

I am computing a series of Dataset aggregations using scala's /: operator. The code for the aggregations is listed below:

def execute1( 
xy: DATASET, 
f: Double => Double): Double = {

println("PRINTING: The data points being evaluated: " + xy)
println("PRINTING: Running execute1")

var z = xy.filter{ case(x, y) => abs(y) > EPS}

var ret = - z./:(0.0) { case(s, (x, y)) => {
   var px = f(x)
   s + px*log(px/y)}  
}

ret
}

My issue occurs when I try running the block for a list of separate functions which are passed in as the f parameter. The list of functions is:

  lazy val pdfs = Map[Int, Double => Double](
1 -> betaScaled,
2 -> gammaScaled,
3 -> logNormal,
4 -> uniform,
5 -> chiSquaredScaled
)

The executor function that runs the aggregations through the list is:

  def execute2( 
xy: DATASET, 
fs: Iterable[Double=>Double]): Iterable[Double] = { 
fs.map(execute1(xy, _))
}

With the final execution block:

val kl_rdd  = master_ds.mapPartitions((it:DATASET) => {
val pdfsList = pdfs_broadcast.value.map(
     n => pdfs.get(n).get
)

execute2(it, pdfsList).iterator

The problem is, while the aggregations do occur, they seem to all aggregate in the first slot of the output array, when I would like the aggregation for each function to be displayed separately. I ran tests to confirm that all five functions are actually being run, and that they are being summed in the first slot.

The pre-divergence value: -4.999635700491883
The pre-divergence value: -0.0
The pre-divergence value: -0.0
The pre-divergence value: -0.0
The pre-divergence value: -0.0

This is one of the hardest problems I've ever run into, so any direction would be GREATLY appreciated. Will give credit where its due. Thanks!

Answer 1

Spark's dataset doesn't have foldLeft (aka /: ): https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.Dataset and actually requires type parameter DataSet[T] and its name is not all capital case.

So, I suppose your DATASET 's type is an iterator, so it gets drained after first run of execute1 , so every subsequent execute1 gets empty iterator. Basically, it doesn't aggregate all functions - it just executes first one and ignores the other ones (you get -0.0 because you passed 0.0 as initial value to foldLeft).

As you can see from mapPartitions signature:

def mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]

it gives you an iterator (mutable structure that can be traversed only once), so you should do it.toList in order to get (potentially but limited large) immutable structure ( List ).

PS if you want to really work with Spark's DataSet/RDD - use aggregate (RDD) or agg (DataSet). See also: foldLeft or foldRight equivalent in Spark?

Explanation about iterators:

scala> val it = List(1,2,3).toIterator
it: Iterator[Int] = non-empty iterator

scala> it.toList //traverse iterator and accumulate its data into List
res0: List[Int] = List(1, 2, 3)

scala> it.toList //iterator is drained, so second call doesn't traverse anything
res1: List[Int] = List()

Issues Aggregating Spark Datasets in Scala

Question

1 answers

solution1
1 ACCPTED 2017-05-27 05:18:42

Issues Aggregating Spark Datasets in Scala

Question

1 answers

solution1 1 ACCPTED 2017-05-27 05:18:42

solution1
1 ACCPTED 2017-05-27 05:18:42