简体   繁体   中英

implicit val serialization when using global object in spark-shell

It's not clear to me why the (non-serializable) implicit val gets serialized (exception thrown) here:

implicit val sc2:SparkContext = sc
val s1 = "asdf"
sc.parallelize(Array(1,2,3)).map(x1 => s1.map(x => 4))

but not when s1's value is in the scope of the closure:

implicit val sc2:SparkContext = sc
sc.parallelize(Array(1,2,3)).map(x1 => "asdf".map(x => 4))

My use case is obviously more complicated but I've boiled it down to this issue.

(The solution is to define the implicit val as @transient)

That depends on the scope where these lines reside :

Let's have a look at three options - in a method , in a class without s1 , and in a class with s1 :

object TTT {

  val sc = new SparkContext("local", "test")

  def main(args: Array[String]): Unit = {
    new A().foo()  // works
    new B          // works
    new C          // fails
  }

  class A {
    def foo(): Unit = {
      // no problem here: vars in a method can be serialized on their own
      implicit val sc2: SparkContext = sc
      val s1 = "asdf"
      sc.parallelize(Array(1, 2, 3)).map(x1 => s1.map(x => 4)).count()
      println("in A - works!")
    }
  }

  class B {
    // no problem here: B isn't serialized at all because there are no references to its members
    implicit val sc2: SparkContext = sc
    sc.parallelize(Array(1, 2, 3)).map(x1 => "asdf".map(x => 4)).count()
    println("in B - works!")
  }

  class C extends Serializable {
    implicit val sc2: SparkContext = sc
    val s1 = "asdf" // to serialize s1, Spark will try serializing the YYY instance, which will serialize sc2
    sc.parallelize(Array(1, 2, 3)).map(x1 => s1.map(x => 4)).count() // fails
  }

}

Bottom line - implicit or not, this will fail if and only if s1 and sc2 are members of a class, which would mean the class would have to be serialized and will "drag" them both with it.

The scope is spark-shell REPL. In this case, sc2 (and any other implicit vals defined in the top-level REPL scope) is only serlalized when it's implicit AND another val from that scope used in the RDD operation. This makes because implicit values need to be made available globally and hence are automatically serialized to all worker nodes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM