Spark-在迭代（或递归）函数调用的情况下如何处理惰性求值

Question

我有一个递归函数，需要将当前调用的结果与上一个调用的结果进行比较，以确定是否已经达到收敛。 我的函数不包含任何action -它仅包含map ， flatMap和reduceByKey 。 由于Spark不评估转换（直到调用一个动作），因此我的下一个迭代没有获得适当的值以进行收敛比较。

这是功能的骨架-

def func1(sc: SparkContext, nodes:RDD[List[Long]], didConverge: Boolean, changeCount: Int) RDD[(Long] = {

   if (didConverge)
      nodes
   else { 
       val currChangeCount = sc.accumulator(0, "xyz")         
       val newNodes = performSomeOps(nodes, currChangeCount) // does a few map/flatMap/reduceByKey operations
       if (currChangeCount.value == changeCount)  {
          func1(sc, newNodes, true, currChangeCount.value)
       } else {
          func1(sc, newNode, false, currChangeCount.value)
       }
   }
}

performSomeOps仅包含map ， flatMap和reduceByKey转换。 由于它没有任何动作，因此performSomeOps中的代码不会执行。 所以我的currChangeCount没有得到实际的计数。 这意味着检查收敛的条件（ currChangeCount.value == changeCount ）将无效。 克服的一种方法是通过调用count在每个迭代中强制执行操作，但这是不必要的开销。

我想知道如何才能强制执行一项没有太多开销的操作，或者是否有另一种方法可以解决此问题？

Answer 1

我相信您在这里缺少一件非常重要的事情：

对于仅在操作内部执行的累加器更新，Spark保证每个任务对累加器的更新将仅应用一次，即重新启动的任务将不会更新该值。 在转换中，用户应意识到，如果重新执行任务或作业阶段，则可能不止一次应用每个任务的更新。

因此，累加器不能可靠地用于管理控制流，更适合于作业监视。

此外，执行动作不是不必要的开销 。 如果您想知道计算的结果是什么，则必须执行它。 除非结果当然是微不足道的。 可能最便宜的操作是：

rdd.foreach { case _ =>  }

但这无法解决您遇到的问题。

通常，Spark中的迭代计算可以构造如下：

def func1(chcekpoinInterval: Int)(sc: SparkContext, nodes:RDD[List[Long]], 
    didConverge: Boolean, changeCount: Int, iteration: Int) RDD[(Long] = {

  if (didConverge) nodes
  else {

    // Compute and cache new nodes
    val newNodes = performSomeOps(nodes, currChangeCount).cache

    // Periodically checkpoint to avoid stack overflow
    if (iteration % checkpointInterval == 0) newNodes.checkpoint

    /* Call a function which computes values
     that determines control flow. This execute an action on newNodes.
    */
    val changeCount = computeChangeCount(newNodes)

    // Unpersist old nodes
    nodes.unpersist

    func1(checkpointInterval)(
      sc, newNodes, currChangeCount.value == changeCount, 
      currChangeCount.value, iteration + 1
    )
  }
}

Answer 2

我看到这些map/flatMap/reduceByKey转换正在更新累加器。 因此，执行所有更新的唯一方法是执行所有这些功能，而count是最简单的方法，与其他方式（ cache + count ， first或collect ）相比，它提供了最低的开销。

Answer 3

先前的答案使我走上了正确的道路，以解决类似的收敛检测问题。

foreach在文档中表示为：

foreach(func) ：在数据集的每个元素上运行函数func 。 通常这样做是出于副作用，例如更新累加器或与外部存储系统进行交互。

似乎rdd.foreach() 使用 rdd.foreach() 作为廉价动作来触发放置在各种转换中的累加器增量，不如将其本身用于增量。

我无法生成一个scala示例，但是如果仍然可以帮助，这是一个基本的java版本：

// Convergence is reached when two iterations
// return the same number of results
long previousCount = -1;
long currentCount = 0;

while (previousCount != currentCount){
    rdd = doSomethingThatUpdatesRdd(rdd);

    // Count entries in new rdd with foreach + accumulator
    rdd.foreach(tuple -> accumulator.add(1));

    // Update helper values
    previousCount = currentCount;
    currentCount = accumulator.sum();
    accumulator.reset();
}
// Convergence is reached

Spark-在迭代（或递归）函数调用的情况下如何处理惰性求值

问题描述

3 个解决方案

解决方案1
2 已采纳 2016-11-11 22:23:39

解决方案2
0 2016-11-11 22:21:04

解决方案3
0 2018-03-27 10:16:23

Spark-在迭代（或递归）函数调用的情况下如何处理惰性求值

问题描述

3 个解决方案

解决方案1 2 已采纳 2016-11-11 22:23:39

解决方案2 0 2016-11-11 22:21:04

解决方案3 0 2018-03-27 10:16:23

解决方案1
2 已采纳 2016-11-11 22:23:39

解决方案2
0 2016-11-11 22:21:04

解决方案3
0 2018-03-27 10:16:23