如何在火花中对每个执行器执行一次操作

Question

I have a weka model stored in S3 which is of size around 400MB.我有一个存储在 S3 中的 weka 模型，其大小约为 400MB。 Now, I have some set of record on which I want to run the model and perform prediction.现在，我有一组记录，我想在这些记录上运行模型并执行预测。

For performing prediction, What I have tried is,为了执行预测，我尝试过的是，

Download and load the model on driver as a static object , broadcast it to all executors.在驱动程序上下载并加载模型作为静态对象，将其广播给所有执行程序。 Perform a map operation on prediction RDD.对预测 RDD 执行映射操作。 ----> Not working, as in Weka for performing prediction, model object needs to be modified and broadcast require a read-only copy. ----> 不工作，如在 Weka 中执行预测，模型对象需要修改并且广播需要只读副本。
Download and load the model on driver as a static object and send it to executor in each map operation.将模型作为静态对象下载并加载到驱动程序上，并在每个映射操作中将其发送到执行程序。 -----> Working (Not efficient, as in each map operation, i am passing 400MB object) -----> 工作（效率不高，因为在每个地图操作中，我传递了 400MB 对象）
Download the model on driver and load it on each executor and cache it there.在驱动程序上下载模型并将其加载到每个执行器上并将其缓存在那里。 (Don't know how to do that) （不知道怎么弄）

Does someone have any idea how can I load the model on each executor once and cache it so that for other records I don't load it again?有人知道如何在每个执行器上加载模型一次并缓存它，以便我不会再次加载其他记录吗？

Answer 1

You have two options:您有两个选择：

1. Create a singleton object with a lazy val representing the data: 1. 创建一个用惰性 val 表示数据的单例对象：

    object WekaModel {
        lazy val data = {
            // initialize data here. This will only happen once per JVM process
        }
    }

Then, you can use the lazy val in your map function.然后，您可以在map函数中使用惰性 val。 The lazy val ensures that each worker JVM initializes their own instance of the data. lazy val确保每个工作 JVM 初始化他们自己的数据实例。 No serialization or broadcasts will be performed for data .不会对data执行序列化或广播。

    elementsRDD.map { element =>
        // use WekaModel.data here
    }

Advantages优势

is more efficient, as it allows you to initialize your data once per JVM instance.效率更高，因为它允许您为每个 JVM 实例初始化一次数据。 This approach is a good choice when needing to initialize a database connection pool for example.例如，当需要初始化数据库连接池时，这种方法是一个不错的选择。

Disadvantages缺点

Less control over initialization.对初始化的控制较少。 For example, it's trickier to initialize your object if you require runtime parameters.例如，如果您需要运行时参数，则初始化您的对象会比较棘手。
You can't really free up or release the object if you need to.如果需要，您无法真正释放或释放对象。 Usually, that's acceptable, since the OS will free up the resources when the process exits.通常，这是可以接受的，因为操作系统会在进程退出时释放资源。

2. Use the `mapPartition` (or `foreachPartition` ) method on the RDD instead of just `map` . 2. 在 RDD 上使用`mapPartition` （或`foreachPartition` ）方法，而不仅仅是`map` 。

This allows you to initialize whatever you need for the entire partition.这允许您初始化整个分区所需的任何内容。

    elementsRDD.mapPartition { elements =>
        val model = new WekaModel()

        elements.map { element =>
            // use model and element. there is a single instance of model per partition.
        }
    }

Advantages :优点：

Provides more flexibility in the initialization and deinitialization of objects.在对象的初始化和取消初始化方面提供更大的灵活性。

Disadvantages缺点

Each partition will create and initialize a new instance of your object.每个分区都会创建并初始化对象的一个新实例。 Depending on how many partitions you have per JVM instance, it may or may not be an issue.根据每个 JVM 实例有多少个分区，这可能是也可能不是问题。

Answer 2

Here's what worked for me even better than the lazy initializer.这是比惰性初始化程序对我更有效的方法。 I created an object level pointer initialized to null, and let each executor initialize it.我创建了一个初始化为空的对象级指针，并让每个执行程序对其进行初始化。 In the initialization block you can have run-once code.在初始化块中，您可以拥有一次性代码。 Note that each processing batch will reset local variables but not the Object-level ones.请注意，每个处理批次将重置局部变量，但不会重置对象级变量。

object Thing1 {
  var bigObject : BigObject = null

  def main(args: Array[String]) : Unit = {
    val sc = <spark/scala magic here>
    sc.textFile(infile).map(line => {
      if (bigObject == null) {
         // this takes a minute but runs just once
         bigObject = new BigObject(parameters)  
      }
      bigObject.transform(line)
    })
  }
}

This approach creates exactly one big object per executor, rather than the one big object per partition of other approaches.这种方法为每个执行器创建一个大对象，而不是其他方法的每个分区创建一个大对象。

If you put the var bigObject : BigObject = null within the main function namespace, it behaves differently.如果将var bigObject : BigObject = null放在 main 函数命名空间中，它的行为会有所不同。 In that case, it runs the bigObject constructor at the beginning of each partition (ie. batch).在这种情况下，它会在每个分区（即批处理）的开头运行 bigObject 构造函数。 If you have a memory leak, then this will eventually kill the executor.如果您有内存泄漏，那么这最终会杀死执行程序。 Garbage collection would also need to do more work.垃圾收集还需要做更多的工作。

Answer 3

Here is what we usually do这是我们通常做的

define a singleton client that do those kind of stuff to ensure only one client is present in each executors定义一个做这些事情的单例客户端，以确保每个执行程序中只有一个客户端
have a getorcreate method to create or fetch the client information, usulaly let's you have a common serving platform you want to serve for multiple different models, then we can use like concurrentmap to ensure threadsafe and computeifabsent有一个 getorcreate 方法来创建或获取客户端信息，通常让你有一个想要为多个不同模型服务的公共服务平台，然后我们可以使用像 concurrentmap 来确保线程安全和计算不存在
the getorcreate method will be called inside RDD level like transform or foreachpartition, so make sure init happen in executor level getorcreate 方法将在 RDD 级别内调用，如转换或 foreachpartition，因此请确保 init 发生在执行程序级别

Answer 4

You can achieve this by broadcasting a case object with a lazy val as follows:您可以通过使用惰性 val 广播 case 对象来实现此目的，如下所示：

case object localSlowTwo {lazy val value: Int = {Thread.sleep(1000); 2}}
val broadcastSlowTwo = sc.broadcast(localSlowTwo)
(1 to 1000).toDS.repartition(100).map(_ * broadcastSlowTwo.value.value).collect

The event timeline for this on three executors with three threads each looks as follows:三个线程的三个执行程序的事件时间线如下所示：

Running the last line again from the same spark-shell session does not initialize any more:从同一个 spark-shell 会话再次运行最后一行不再初始化：

如何在火花中对每个执行器执行一次操作

问题描述

4 个解决方案

解决方案1
34 2016-10-14 02:46:36

1. Create a singleton object with a lazy val representing the data: 1. 创建一个用惰性 val 表示数据的单例对象：

2. Use the `mapPartition` (or `foreachPartition` ) method on the RDD instead of just `map` . 2. 在 RDD 上使用`mapPartition` （或`foreachPartition` ）方法，而不仅仅是`map` 。

解决方案2
2 2017-05-18 06:24:43

解决方案3
1 2019-12-20 22:59:11

解决方案4
0 2020-11-14 16:51:23

如何在火花中对每个执行器执行一次操作

问题描述

4 个解决方案

解决方案1 34 2016-10-14 02:46:36

1. Create a singleton object with a lazy val representing the data: 1. 创建一个用惰性 val 表示数据的单例对象：

2. Use the mapPartition (or foreachPartition ) method on the RDD instead of just map . 2. 在 RDD 上使用mapPartition （或foreachPartition ）方法，而不仅仅是map 。

解决方案2 2 2017-05-18 06:24:43

解决方案3 1 2019-12-20 22:59:11

解决方案4 0 2020-11-14 16:51:23

解决方案1
34 2016-10-14 02:46:36

2. Use the `mapPartition` (or `foreachPartition` ) method on the RDD instead of just `map` . 2. 在 RDD 上使用`mapPartition` （或`foreachPartition` ）方法，而不仅仅是`map` 。

解决方案2
2 2017-05-18 06:24:43

解决方案3
1 2019-12-20 22:59:11

解决方案4
0 2020-11-14 16:51:23