为什么这个 Spark 代码在本地模式下工作，而不在集群模式下工作？

Question

So, I have something like this.所以，我有这样的事情。 Note that baseTrait (a trait) here is serializable, and therefore thisClass (an Object class) should also be serializable.请注意，此处的baseTrait （一个 trait）是可序列化的，因此thisClass （一个 Object 类）也应该是可序列化的。

object thisClass extends baseTrait {
  private var someVar = null 

  def someFunc: RDD[...] {
    ...
    // assigned some string value or an empty string value (not null anymore)
    someVar = ... 
    ...
    if (someVar != "")
      someRDD.filter(x => aFunc(x, someVar))
    else
      ...
  }

In cluster mode, when I call the someFunc function (which is a static method since thisClass is an Object class) I get a null pointer exception, which I think has to do with someVar not being serialized properly. In cluster mode, when I call the someFunc function (which is a static method since thisClass is an Object class) I get a null pointer exception, which I think has to do with someVar not being serialized properly. Because when I do this, it works perfectly in cluster mode.因为当我这样做时，它可以在集群模式下完美运行。

if (someVar != "") {
  val someVar_ = someVar
  someRDD.filter(x => aFunc(x, someVar_))
}

Any idea what was going wrong in the original code, when thisClass is serializable in the first place?当thisClass是可序列化的时，知道原始代码中出了什么问题吗？

My guess is that its fine to use a variable of a serializable class from within another class, but if you try to do it inside that class, you can have problems, as in that case you would have the runtime trying to serialize the same class where the closure is being called from. My guess is that its fine to use a variable of a serializable class from within another class, but if you try to do it inside that class, you can have problems, as in that case you would have the runtime trying to serialize the same class从哪里调用闭包。 What do you think?你怎么看？

Answer 1

You are not experiencing a problem with serialization in this case.在这种情况下，您没有遇到序列化问题。

Basically, what happens in cluster mode is that thisClass.someFunc is never actually executed in the remote executor's JVM.基本上，在集群模式下发生的事情是thisClass.someFunc从未在远程执行器的 JVM 中实际执行。 On the executor, thisClass is instantiated, and someVar is assigned null .在执行器上， thisClass被实例化， someVar被分配null 。 Then while the thisClass object is in that state, the spark framework executes your lambda function directly on the records that are available in that executor's partition of the data. Then while the thisClass object is in that state, the spark framework executes your lambda function directly on the records that are available in that executor's partition of the data.

A way to avoid this is to move the assignment to someVar into the body of the thisClass object.避免这种情况的一种方法是将分配给someVar移动到thisClass object 的主体中。 Doing that will assign the value to someVar immediately when the object is instantiated.这样做会在 object 被实例化时立即将值分配给someVar 。 Bear in mind that this code will be executed on every executor in the cluster.请记住，此代码将在集群中的每个执行程序上执行。

If that is not possible, another option would be to map your RDD[T] to RDD[(T, String)] , where the string is someVar for every record, and then your filter could be something like .filter(x => aFunc(x._1, x._2)) .如果这是不可能的，另一个选择是 map 你的RDD[T]到RDD[(T, String)] ，其中字符串是每条记录的someVar ，然后你的过滤器可能类似于.filter(x => aFunc(x._1, x._2)) 。 This method will use more memory, as you'll have many copies of someVar 's value.此方法将使用更多 memory，因为您将拥有someVar值的许多副本。

为什么这个 Spark 代码在本地模式下工作，而不在集群模式下工作？

问题描述

1 个解决方案

解决方案1
3 已采纳 2019-11-01 18:45:23

为什么这个 Spark 代码在本地模式下工作，而不在集群模式下工作？

问题描述

1 个解决方案

解决方案1 3 已采纳 2019-11-01 18:45:23

解决方案1
3 已采纳 2019-11-01 18:45:23