如何并行化Spark Scala计算？

Question

我有代码可以在聚类后计算平方误差的集合和内，我主要从Spark mllib源代码中获取代码。

当我使用spark API运行类似代码时，它将在许多不同的（分布式）作业中运行并成功运行。 当我运行它的代码（应该与Spark代码做同样的事情）时，我得到了堆栈溢出错误。 有什么想法吗？

这是代码：

import java.util.Arrays
        import org.apache.spark.mllib.linalg.{Vectors, Vector}
        import org.apache.spark.mllib.linalg._
        import org.apache.spark.mllib.linalg.distributed.RowMatrix
        import org.apache.spark.rdd.RDD
        import org.apache.spark.api.java.JavaRDD
        import breeze.linalg.{axpy => brzAxpy, inv, svd => brzSvd, DenseMatrix => BDM, DenseVector => BDV,
          MatrixSingularException, SparseVector => BSV, CSCMatrix => BSM, Matrix => BM}

        val EPSILON = {
            var eps = 1.0
            while ((1.0 + (eps / 2.0)) != 1.0) {
              eps /= 2.0
            }
            eps
          }

        def dot(x: Vector, y: Vector): Double = {
            require(x.size == y.size,
              "BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:" +
              " x.size = " + x.size + ", y.size = " + y.size)
            (x, y) match {
              case (dx: DenseVector, dy: DenseVector) =>
                dot(dx, dy)
              case (sx: SparseVector, dy: DenseVector) =>
                dot(sx, dy)
              case (dx: DenseVector, sy: SparseVector) =>
                dot(sy, dx)
              case (sx: SparseVector, sy: SparseVector) =>
                dot(sx, sy)
              case _ =>
                throw new IllegalArgumentException(s"dot doesn't support (${x.getClass}, ${y.getClass}).")
            }
         }

         def fastSquaredDistance(
              v1: Vector,
              norm1: Double,
              v2: Vector,
              norm2: Double,
              precision: Double = 1e-6): Double = {
            val n = v1.size
            require(v2.size == n)
            require(norm1 >= 0.0 && norm2 >= 0.0)
            val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
            val normDiff = norm1 - norm2
            var sqDist = 0.0
            /*
             * The relative error is
             * <pre>
             * EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
             * </pre>
             * which is bounded by
             * <pre>
             * 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
             * </pre>
             * The bound doesn't need the inner product, so we can use it as a sufficient condition to
             * check quickly whether the inner product approach is accurate.
             */
            val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
            if (precisionBound1 < precision) {
              sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
            } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
              val dotValue = dot(v1, v2)
              sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
              val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
                (sqDist + EPSILON)
              if (precisionBound2 > precision) {
                sqDist = Vectors.sqdist(v1, v2)
              }
            } else {
              sqDist = Vectors.sqdist(v1, v2)
            }
            sqDist
        }

        def findClosest(
              centers: TraversableOnce[Vector],
              point: Vector): (Int, Double) = {
            var bestDistance = Double.PositiveInfinity
            var bestIndex = 0
            var i = 0
            centers.foreach { center =>
              // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
              // distance computation.
              var lowerBoundOfSqDist = Vectors.norm(center, 2.0) - Vectors.norm(point, 2.0)
              lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
              if (lowerBoundOfSqDist < bestDistance) {
                val distance: Double = fastSquaredDistance(center, Vectors.norm(center, 2.0), point, Vectors.norm(point, 2.0))
                if (distance < bestDistance) {
                  bestDistance = distance
                  bestIndex = i
                }
              }
              i += 1
            }
            (bestIndex, bestDistance)
        }

         def pointCost(
              centers: TraversableOnce[Vector],
              point: Vector): Double =
            findClosest(centers, point)._2



        def clusterCentersIter: Iterable[Vector] =
            clusterCenters.map(p => p)


        def computeCostZep(indata: RDD[Vector]): Double = {
            val bcCenters = indata.context.broadcast(clusterCenters)
            indata.map(p => pointCost(bcCenters.value, p)).sum()
          }

        computeCostZep(projectedData)

我相信我将所有相同的并行化作业用作spark，但对我而言不起作用。 关于分发/帮助我的代码的任何建议，看看为什么我的代码中会发生内存溢出，这将非常有帮助

这是一个非常类似spark的源代码链接： KMeansModel和KMeans

这是运行良好的代码：

val clusters = KMeans.train(projectedData, numClusters, numIterations)

val clusterCenters = clusters.clusterCenters




// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(projectedData)
println("Within Set Sum of Squared Errors = " + WSSSE)

这是错误输出：

org.apache.spark.SparkException：由于阶段失败而导致作业中止：94.0阶段中的任务1失败了4次，最近一次失败：94.0阶段中的任务1.3丢失（TID 37663，ip-172-31-13-209.ec2。内部）：java.lang.StackOverflowError at $ iwC $$ iwC $$ iwC $$ iwC $ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $ iwC $$$$$$ c57ec8bf9b0d5f6161b97741d596ff0 $$ $$ wC $$ iwC $ iwC $$ iwC $$ iwC $ iwC $$ iwC $ iwC $ iwC $$ iwC $$ iwC $$ iwC $ iwC $$ iwC $$ iwC $$ iwC $$ iwC.dot（：226）在$ iwC $$ iwC $ iwC $$ iwC $$ iwC $ iwC $ iwC $$ iwC $$ iwC $ iwC $$ iwC $$$$$ c57ec8bf9b0d5f6161b97741d596ff0 $$$ $ wC $ iwC $ iwC $ iwC $ iwC $ iwC $$ iwC $ iwC $ iwC $ iwC $$ iwC $ iwC $ iwC $ iwC $$ iwC $ iwC $$ iwC .dot（：226）...

然后下来：

驱动程序堆栈跟踪：位于org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1上的org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages（DAGScheduler.scala：1431）。在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply（DAGScheduler.scala：1418）处应用（DAGScheduler.scala：1419）在scala.collection.mutable.ResizableArray $ class.foreach（ResizableArray.scala： 59）在org.apache.spark.scheduler.DAGScheduler.abortStage（DAGScheduler.scala：1418）的scala.collection.mutable.ArrayBuffer.foreach（ArrayBuffer.scala：47）在org.apache.spark.scheduler.DAGScheduler $$ org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply（DAGScheduler.scala：799）上的anonfun $ handleTaskSetFailed $ 1.apply（DAGScheduler.scala：799）在scala.Option.foreach（Option.scala：236）），位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive（DAGScheduler.org）上的org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed（DAGScheduler.scala：799）。 scala：1640）位于org.apache.spark.org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive（DAGScheduler.scala：1599），org.org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive（DAGScheduler.scala：1588）位于org.apache.spark。 .org.apache.spark.scheduler.DAGScheduler.runJob（DAGScheduler.scala：620）上的.EventLoop $$ anon $ 1.run（EventLoop.scala：48）在org.apache.spark.SparkContext.runJob（SparkContext.scala：1832））在org.apache.spark.rdd.RDD $$ anonfun $ fold $ 1.apply（RDD.scala：1088）在org.apache.spark.rdd.RDD $ org.aply（RDD.scala：1088）在org.apache.spark。 org.apache.spark.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：111）的rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：150）在org.apache.spark.rdd.RDD.withScope（RDD.scala）： 316），位于org.apache.spark.rdd.RDD.fold（RDD.scala：1082），位于org.apache.spark.rdd.DoubleRDDFunctions $$ anonfun $ sum $ 1.apply $ mcD $ sp（DoubleRDDFunctions.scala：34）在org.apache.spark.rdd.DoubleRDDFunctions $$ anonfun $ sum $ 1.apply（DoubleRDDFunctions.scala：34）在o rg.apache.spark.rdd.DoubleRDDFunctions $$ anonfun $ sum $ 1.apply（DoubleRDDFunctions.scala：34）在org.apache.spark.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：150）在org.apache.spark位于org.apache.spark.rdd.RDD.RDD.withScope（RDD.scala：316）的.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：111），位于org.apache.spark.rdd.DoubleRDDFunctions.sum（DoubleRDDFunctions.scala： 33）

Answer 1

似乎很简单，正在发生什么：您在这里递归调用dot方法：

def dot(x: Vector, y: Vector): Double = {
        require(x.size == y.size,
          "BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:" +
          " x.size = " + x.size + ", y.size = " + y.size)
        (x, y) match {
          case (dx: DenseVector, dy: DenseVector) =>
            dot(dx, dy)
          case (sx: SparseVector, dy: DenseVector) =>
            dot(sx, dy)
          case (dx: DenseVector, sy: SparseVector) =>
            dot(sy, dx)
          case (sx: SparseVector, sy: SparseVector) =>
            dot(sx, sy)
          case _ =>
            throw new IllegalArgumentException(s"dot doesn't support (${x.getClass}, ${y.getClass}).")
        }
     }

随后的递归调用，以dot都使用相同的参数，因为前者-因此从来就没有结论了递归。

stacktrace也会告诉您-注意位置在点方法中：

$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$$$$ c57ec8bf9b0d5f6161b97741d596ff0 $$ w $$ iwC $$ iwC $$ iwC $ iwC $ iwC $$ iwC $$ iwC $ iwC $ iwC $$ iwC $$ iwC $ iwC $ iwC $$ iwC $ iwC $$ iwC.dot （：226）在

如何并行化Spark Scala计算？

问题描述

1 个解决方案

解决方案1
4 已采纳 2016-05-29 18:40:28

如何并行化Spark Scala计算？

问题描述

1 个解决方案

解决方案1 4 已采纳 2016-05-29 18:40:28

解决方案1
4 已采纳 2016-05-29 18:40:28