简体   繁体   English

在spark-shell中使用全局对象时隐式val序列化

[英]implicit val serialization when using global object in spark-shell

It's not clear to me why the (non-serializable) implicit val gets serialized (exception thrown) here: 我不清楚为什么(不可序列化)隐式val在这里被序列化(抛出异常):

implicit val sc2:SparkContext = sc
val s1 = "asdf"
sc.parallelize(Array(1,2,3)).map(x1 => s1.map(x => 4))

but not when s1's value is in the scope of the closure: 但是当s1的值在闭包的范围内时:

implicit val sc2:SparkContext = sc
sc.parallelize(Array(1,2,3)).map(x1 => "asdf".map(x => 4))

My use case is obviously more complicated but I've boiled it down to this issue. 我的用例显然更复杂,但我将其归结为这个问题。

(The solution is to define the implicit val as @transient) (解决方案是将隐式val定义为@transient)

That depends on the scope where these lines reside : 这取决于这些行所在的范围

Let's have a look at three options - in a method , in a class without s1 , and in a class with s1 : 让我们看一下三个选项-在method中 ,在没有s1类中 ,以及在s1类中

object TTT {

  val sc = new SparkContext("local", "test")

  def main(args: Array[String]): Unit = {
    new A().foo()  // works
    new B          // works
    new C          // fails
  }

  class A {
    def foo(): Unit = {
      // no problem here: vars in a method can be serialized on their own
      implicit val sc2: SparkContext = sc
      val s1 = "asdf"
      sc.parallelize(Array(1, 2, 3)).map(x1 => s1.map(x => 4)).count()
      println("in A - works!")
    }
  }

  class B {
    // no problem here: B isn't serialized at all because there are no references to its members
    implicit val sc2: SparkContext = sc
    sc.parallelize(Array(1, 2, 3)).map(x1 => "asdf".map(x => 4)).count()
    println("in B - works!")
  }

  class C extends Serializable {
    implicit val sc2: SparkContext = sc
    val s1 = "asdf" // to serialize s1, Spark will try serializing the YYY instance, which will serialize sc2
    sc.parallelize(Array(1, 2, 3)).map(x1 => s1.map(x => 4)).count() // fails
  }

}

Bottom line - implicit or not, this will fail if and only if s1 and sc2 are members of a class, which would mean the class would have to be serialized and will "drag" them both with it. 底线-是否隐含,只有当s1sc2是类的成员时,该操作才会失败,这意味着该类必须进行序列化,并使用它们“拖动”它们。

The scope is spark-shell REPL. 范围是火花壳REPL。 In this case, sc2 (and any other implicit vals defined in the top-level REPL scope) is only serlalized when it's implicit AND another val from that scope used in the RDD operation. 在这种情况下,仅当sc2(以及在顶级REPL范围中定义的任何其他隐式val)是隐式的并且与RDD操作中使用的该范围的另一个val时才被序列化。 This makes because implicit values need to be made available globally and hence are automatically serialized to all worker nodes. 这是因为隐式值需要全局可用,因此会自动序列化到所有工作程序节点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM