简体   繁体   English

执行程序上的Spark对象(单例)序列化

[英]Spark Object (singleton) serialization on executors

I am not sure that what I want to achieve is possible. 我不确定我想要实现的目标是否可行。 What I do know is I am accessing a singleton object from an executor to ensure it's constructor has been called only once on each executor. 我所知道的是,我正在从执行程序访问单例对象,以确保其构造函数在每个执行程序上仅被调用一次。 This pattern is already proven and works as expected for similar use cases in my code base. 这种模式已经得到证明,并且可以在我的代码库中的类似用例中正常使用。

However, What I would like to know is if I can ship the object after it has been initialized on the driver. 但是,我想知道的是,是否可以在驱动程序上将对象初始化后再运送对象。 In this scenario, when accesing ExecutorAccessedObject.y , ideally it would not call the println but just return the value. 在这种情况下,当访问ExecutorAccessedObject.y ,理想情况下,它不会调用println而是仅返回值。 This is a highly simplified version, in reality, I would like to make a call to some external system on the driver, so when accessed on the executor, it will not re-call that external system. 这是一个高度简化的版本,实际上,我想在驱动程序上调用某些外部系统,因此在执行程序上访问时,它不会重新调用该外部系统。 I am ok with @transient lazy val x to be reinitialized once on the executors, as that will hold a connection pool which cannot be serialized. 我对@transient lazy val x可以在执行程序上重新初始化一次没问题,因为它将保存一个无法序列化的连接池。

object ExecutorAccessedObject extends Serializable {
  @transient lazy val x: Int = {
    println("Ok with initialzing this on the executor. I.E. database connection pool")
    1
  }

  val y: Int = {
    // call some external system to return a value.
    // I do not want to call the external system from the executor
    println(
      """
        |Idealy, this would not be printed on the executor.
        |return value 1 without re initializing
      """)
    1
  }
  println("The constructor will be initialized Once on each executor")
}


someRdd.mapPartitions { part =>
  ExecutorAccessedObject
  ExecutorAccessedObject.x // first time accessed should re-evaluate
  ExecutorAccessedObject.y // idealy, never re-evaluate and return 1
  part
}

I attempted to solve this with broadcast variables as well, but I am unsure how to access the broadcast variable within the singleton object. 我也尝试使用广播变量来解决这个问题,但是我不确定如何在单例对象中访问广播变量。

What I would like to know is if I can ship the object after it has been initialized on the driver. 我想知道的是,在驱动程序上初始化对象之后,是否可以运送该对象。

You cannot. 你不能。 Objects , as singletons, are never shipped to executors. 作为单例的Objects永远不会交付给执行者。 There initialized locally, whenever objects is accessed for the first time. 每当首次访问对象时,都会在本地进行初始化。

If the result of the call is serializable, just pass it alone, either as an arguments to the ExecutorAccessedObject (implicitly or explicitly) or making ExecutorAccessedObject mutable (and adding required synchronization). 如果调用的结果是可序列化的,则只需将其单独传递,要么作为参数传递给ExecutorAccessedObject (隐式或显式),要么使ExecutorAccessedObject可变(并添加所需的同步)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM