Spark序列化错误之谜

Question

Let's say i have the following code: 假设我有以下代码：

class Context {
  def compute() = Array(1.0)
}
val ctx = new Context
val data = ctx.compute

Now we are running this code in Spark: 现在我们在Spark中运行以下代码：

val rdd = sc.parallelize(List(1,2,3))
rdd.map(_ + data(0)).count()

The code above throws org.apache.spark.SparkException: Task not serializable . 上面的代码抛出org.apache.spark.SparkException: Task not serializable 。 I'm not asking how to fix it , by extending Serializable or making a case class, i want to understand why the error happens. 我不是在问如何通过扩展Serializable或创建一个case类来解决它 ，我想了解为什么会发生错误。

The thing that i don't understand is why it complains about Context class not being a Serializable , though it's not a part of the lambda: rdd.map(_ + data(0)) . 我不明白的是，为什么它抱怨Context类不是Serializable ，尽管它不是lambda的一部分： rdd.map(_ + data(0)) 。 data here is an Array of values which should be serialized, but it seems that JVM also captures ctx reference as well, which, in my understanding, should not happening. 这里的data是一个应该序列化的值数组，但是看来JVM也可以捕获ctx引用，据我所知，这应该不会发生。

As i understand, in the shell Spark should clear lambda from the repl context. 据我了解，Spark应该在repl上下文中清除lambda。 If we print the tree after delambdafy phase, we would see these pieces: 如果在delambdafy阶段之后打印树，我们将看到以下片段：

object iw extends Object {
  ... 
  private[this] val ctx: $line11.iw$Context = _;
  <stable> <accessor> def ctx(): $line11.iw$Context = iw.this.ctx;
  private[this] val data: Array[Double] = _;
  <stable> <accessor> def data(): Array[Double] = iw.this.data; 
  ...
}

class anonfun$1 ... {
  final def apply(x$1: Int): Double = anonfun$1.this.apply$mcDI$sp(x$1);
  <specialized> def apply$mcDI$sp(x$1: Int): Double = x$1.+(iw.this.data().apply(0));
  ...
}

So the decompiled lambda code that is sent to the worker node is: x$1.+(iw.this.data().apply(0)) . 因此，发送到工作程序节点的反编译lambda代码为： x$1.+(iw.this.data().apply(0)) 。 Part iw.this belongs to the Spark-Shell session, so, as i understand, it should be cleared by the ClosureCleaner , since has nothing to do with the logic and shouldn't be serialized. 第iw.this部分属于Spark-Shell会话，因此，据我所知，它应该由ClosureCleaner清除，因为与逻辑无关，因此不应序列化。 Anyway, calling iw.this.data() returns an Array[Double] value of the data variable, which is initialized in the constructor: 无论如何，调用iw.this.data()返回data变量的Array[Double]值，该值在构造函数中初始化：

def <init>(): type = {
  iw.super.<init>();
  iw.this.ctx = new $line11.iw$Context();
  iw.this.data = iw.this.ctx().compute(); // <== here
  iw.this.res4 = ...
  ()
}

In my understanding ctx value has nothing to do with the lambda, it's not a closure, hence shouldn't be serialized. 以我的理解， ctx值与lambda无关，它不是闭包，因此不应序列化。 What am i missing or misunderstanding? 我想念或误解了什么？

Answer 1

This has to do with what Spark considers it can use as a closure safely. 这与Spark认为可以安全地用作闭包有关。 This is in some cases very intuitive, since Spark uses reflection and in many cases can't recognize some of Scala's guarantees (not a full compiler or anything) or the fact that some variables in the same object are irrelevant. 在某些情况下，这非常直观，因为Spark使用反射，并且在许多情况下无法识别Scala的某些保证（不是完整的编译器或任何东西），或者同一对象中的某些变量无关紧要。 For safety, Spark will attempt to serialize any objects referenced, which in your case includes iw , which is not serializable. 为了安全起见，Spark将尝试序列化所有引用的对象，在您的情况下，其中包括iw ，该对象不可序列化。

The code inside ClosureCleaner has a good example: ClosureCleaner中的代码有一个很好的例子：

For instance, transitive cleaning is necessary in the following scenario: 例如，在以下情况下，必须进行传递式清洁：
 class SomethingNotSerializable { def someValue = 1 def scope(name: String)(body: => Unit) = body def someMethod(): Unit = scope("one") { def x = someValue def y = 2 scope("two") { println(y + 1) } } } 
In this example, scope "two" is not serializable because it references scope "one", which references SomethingNotSerializable. 在此示例中，范围“ two”不可序列化，因为它引用了范围“ one”，后者引用了SomethingNotSerializable。 Note that, however, the body of scope "two" does not actually depend on SomethingNotSerializable. 但是请注意，范围“二”的主体实际上并不依赖SomethingNotSerializable。 This means we can safely null out the parent pointer of a cloned scope "one" and set it the parent of scope "two", such that scope "two" no longer references SomethingNotSerializable transitively. 这意味着我们可以安全地使克隆的作用域“一个”的父指针无效，并将其设置为作用域“两个”的父代，这样作用域“两个”不再可传递地引用SomethingNotSerializable。

Probably the easiest fix is to create a local variable, in the same scope, that extracts the value from your object, such that there is no longer any reference to the encapsulating object inside the lambda: 可能最简单的解决方法是在相同范围内创建一个局部变量，该局部变量从您的对象中提取值，以便不再对lambda中的封装对象进行任何引用：

val rdd = sc.parallelize(List(1,2,3))
val data0 = data(0)
rdd.map(_ + data0).count()

Spark序列化错误之谜

问题描述

1 个解决方案

解决方案1
1 2015-11-19 12:59:00

Spark序列化错误之谜

问题描述

1 个解决方案

解决方案1 1 2015-11-19 12:59:00

解决方案1
1 2015-11-19 12:59:00