以编程方式向 Spark 添加/删除执行程序 Session

Question

I'm looking for a reliable way in Spark (v2+) to programmatically adjust the number of executors in a session.我正在 Spark (v2+) 中寻找一种可靠的方法来以编程方式调整 session 中执行程序的数量。

I know about dynamic allocation and the ability to configure spark executors on creation of a session (eg with --num-executors ), but neither of these options are very useful to me because of the nature of my Spark job.我知道动态分配以及在创建 session 时配置 spark 执行程序的能力（例如使用--num-executors ），但由于我的 Spark 工作的性质，这些选项对我来说都不是很有用。

My spark job我的火花工作

The job performs the following steps on a large amount of data:该作业对大量数据执行以下步骤：

Perform some aggregations / checks on the data对数据执行一些聚合/检查
Load the data into Elasticsearch (ES cluster is typically much smaller than Spark cluster)将数据加载到 Elasticsearch（ES 集群通常比 Spark 集群小很多）

The problem问题

If I use the full set of available Spark resources, I will very quickly overload Elasticsearch and potentially even knock over the Elasticsearch nodes.如果我使用全套可用的 Spark 资源，我将很快使 Elasticsearch 过载，甚至可能会破坏 Elasticsearch 个节点。
If I use a small enough number of spark executors so as not overwhelm Elasticsearch, step 1 takes a lot longer than it needs to (because it has a small % of the available spark resources)如果我使用足够少的 spark 执行器以免压倒 Elasticsearch，则第 1 步花费的时间比它需要的时间长得多（因为它只有一小部分可用的 spark 资源）

I appreciate that I can split this job into two jobs which are executed separately with difference Spark resource profiles, but what I really want is to programatically set the number of executors to X at a particular point in my Spark script (before the Elasticsearch load begins).我很高兴我可以将这项工作分成两个工作，这两个工作分别使用不同的 Spark 资源配置文件执行，但我真正想要的是在我的 Spark 脚本中的特定点（在 Elasticsearch 加载开始之前）以编程方式将执行者的数量设置为 X ). This seems like a useful thing to be able to do generally.一般来说，这似乎是一件有用的事情。

My initial attempt我的初步尝试

I played around a bit with changing settings and found something which sort of works, but it feels like a hacky way of doing something which should be doable in a more standardised and supported way.我尝试了一些更改设置并发现了一些可行的方法，但感觉像是一种笨拙的方式来做一些应该以更标准化和受支持的方式可行的事情。

My attempt (this is just me playing around):我的尝试（这只是我在玩）：

def getExecutors = spark.sparkContext.getExecutorStorageStatus.toSeq.map(_.blockManagerId).collect { 
  case bm if !bm.isDriver => bm
}

def reduceExecutors(totalNumber: Int): Unit = {
  //TODO throw error if totalNumber is more than current
  logger.info(s"""Attempting to reduce number of executors to $totalNumber""")
  spark.sparkContext.requestTotalExecutors(totalNumber, 0, Map.empty)
  val killedExecutors = scala.collection.mutable.ListBuffer[String]()
  while (getExecutors.size > totalNumber) {
      val executorIds = getExecutors.map(_.executorId).filterNot(killedExecutors.contains(_))
      val executorsToKill =  Random.shuffle(executorIds).take(executorIds.size - totalNumber)
      spark.sparkContext.killExecutors(executorsToKill)
      killedExecutors ++= executorsToKill
      Thread.sleep(1000)
  }
}

def increaseExecutors(totalNumber: Int): Unit = {
  //TODO throw error if totalNumber is less than current
  logger.info(s"""Attempting to increase number of executors to $totalNumber""")
  spark.sparkContext.requestTotalExecutors(totalNumber, 0, Map.empty)
  while (getExecutors.size < totalNumber) {
      Thread.sleep(1000)
  }
}

Answer 1

One thing you can try is to call你可以尝试的一件事是打电话

val dfForES = df.coalesce(numberOfParallelElasticSearchUploads)

before step #2.在步骤#2 之前。 This would reduce the number of partitions without shuffling overhead and ensure that only max numberOfParallelElasticSearchUploads executors are sending data to ES in parallel while the rest of them are sitting idle.这将减少分区的数量而无需改组开销，并确保只有 max numberOfParallelElasticSearchUploads 执行程序并行向 ES 发送数据，而其中 rest 个执行程序处于空闲状态。

If you're running your job on a shared cluster, I'd still recommend enabling dynamic allocation to release these idle executors for a better resource utilization.如果您在共享集群上运行您的作业，我仍然建议启用动态分配以释放这些空闲执行程序以获得更好的资源利用率。

Answer 2

I was looking for a way to programmatically adjust the number of executors in pyspark and this was the top result.我一直在寻找一种方法来以编程方式调整 pyspark 中执行程序的数量，这是最好的结果。 Here is what I've gathered from Will's question and from poking around with py4j:以下是我从 Will 的问题和对 py4j 的探究中收集到的信息：

# Create the spark session:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(... your configs ...).getOrCreate()

# Increase cluster to 5 executors:
spark._jsparkSession.sparkContext().requestTotalExecutors(5, 0, sc._jvm.PythonUtils.toScalaMap({}))

# Decrease cluster back to zero executors:
spark._jsparkSession.sparkContext().requestTotalExecutors(0, 0, sc._jvm.PythonUtils.toScalaMap({}))
javaExecutorIds = spark._jsparkSession.sparkContext().getExecutorIds()
executorIds = [javaExecutorIds.apply(i) for i in range(javaExecutorIds.length())]
print(f'Killing executors {executorIds}')
spark._jsparkSession.sparkContext().killExecutors(javaExecutorIds)

I hope that saves someone else from excessive googling.我希望这可以避免其他人过度使用谷歌搜索。

以编程方式向 Spark 添加/删除执行程序 Session

问题描述

My spark job我的火花工作

The problem问题

My initial attempt我的初步尝试

2 个解决方案

解决方案1
2 2018-07-18 14:51:32

解决方案2
0 2022-04-22 22:30:57

以编程方式向 Spark 添加/删除执行程序 Session

问题描述

My spark job我的火花工作

The problem问题

My initial attempt我的初步尝试

2 个解决方案

解决方案1 2 2018-07-18 14:51:32

解决方案2 0 2022-04-22 22:30:57

解决方案1
2 2018-07-18 14:51:32

解决方案2
0 2022-04-22 22:30:57