[英]Programmatically add/remove executors to a Spark Session
I'm looking for a reliable way in Spark (v2+) to programmatically adjust the number of executors in a session.我正在 Spark (v2+) 中寻找一种可靠的方法来以编程方式调整 session 中执行程序的数量。
I know about dynamic allocation and the ability to configure spark executors on creation of a session (eg with --num-executors
), but neither of these options are very useful to me because of the nature of my Spark job.我知道动态分配以及在创建 session 时配置 spark 执行程序的能力(例如使用
--num-executors
),但由于我的 Spark 工作的性质,这些选项对我来说都不是很有用。
The job performs the following steps on a large amount of data:该作业对大量数据执行以下步骤:
I appreciate that I can split this job into two jobs which are executed separately with difference Spark resource profiles, but what I really want is to programatically set the number of executors to X at a particular point in my Spark script (before the Elasticsearch load begins).我很高兴我可以将这项工作分成两个工作,这两个工作分别使用不同的 Spark 资源配置文件执行,但我真正想要的是在我的 Spark 脚本中的特定点(在 Elasticsearch 加载开始之前)以编程方式将执行者的数量设置为 X ). This seems like a useful thing to be able to do generally.
一般来说,这似乎是一件有用的事情。
I played around a bit with changing settings and found something which sort of works, but it feels like a hacky way of doing something which should be doable in a more standardised and supported way.我尝试了一些更改设置并发现了一些可行的方法,但感觉像是一种笨拙的方式来做一些应该以更标准化和受支持的方式可行的事情。
My attempt (this is just me playing around):我的尝试(这只是我在玩):
def getExecutors = spark.sparkContext.getExecutorStorageStatus.toSeq.map(_.blockManagerId).collect {
case bm if !bm.isDriver => bm
}
def reduceExecutors(totalNumber: Int): Unit = {
//TODO throw error if totalNumber is more than current
logger.info(s"""Attempting to reduce number of executors to $totalNumber""")
spark.sparkContext.requestTotalExecutors(totalNumber, 0, Map.empty)
val killedExecutors = scala.collection.mutable.ListBuffer[String]()
while (getExecutors.size > totalNumber) {
val executorIds = getExecutors.map(_.executorId).filterNot(killedExecutors.contains(_))
val executorsToKill = Random.shuffle(executorIds).take(executorIds.size - totalNumber)
spark.sparkContext.killExecutors(executorsToKill)
killedExecutors ++= executorsToKill
Thread.sleep(1000)
}
}
def increaseExecutors(totalNumber: Int): Unit = {
//TODO throw error if totalNumber is less than current
logger.info(s"""Attempting to increase number of executors to $totalNumber""")
spark.sparkContext.requestTotalExecutors(totalNumber, 0, Map.empty)
while (getExecutors.size < totalNumber) {
Thread.sleep(1000)
}
}
One thing you can try is to call你可以尝试的一件事是打电话
val dfForES = df.coalesce(numberOfParallelElasticSearchUploads)
before step #2.在步骤#2 之前。 This would reduce the number of partitions without shuffling overhead and ensure that only max numberOfParallelElasticSearchUploads executors are sending data to ES in parallel while the rest of them are sitting idle.
这将减少分区的数量而无需改组开销,并确保只有 max numberOfParallelElasticSearchUploads 执行程序并行向 ES 发送数据,而其中 rest 个执行程序处于空闲状态。
If you're running your job on a shared cluster, I'd still recommend enabling dynamic allocation to release these idle executors for a better resource utilization.如果您在共享集群上运行您的作业,我仍然建议启用动态分配以释放这些空闲执行程序以获得更好的资源利用率。
I was looking for a way to programmatically adjust the number of executors in pyspark and this was the top result.我一直在寻找一种方法来以编程方式调整 pyspark 中执行程序的数量,这是最好的结果。 Here is what I've gathered from Will's question and from poking around with py4j:
以下是我从 Will 的问题和对 py4j 的探究中收集到的信息:
# Create the spark session:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(... your configs ...).getOrCreate()
# Increase cluster to 5 executors:
spark._jsparkSession.sparkContext().requestTotalExecutors(5, 0, sc._jvm.PythonUtils.toScalaMap({}))
# Decrease cluster back to zero executors:
spark._jsparkSession.sparkContext().requestTotalExecutors(0, 0, sc._jvm.PythonUtils.toScalaMap({}))
javaExecutorIds = spark._jsparkSession.sparkContext().getExecutorIds()
executorIds = [javaExecutorIds.apply(i) for i in range(javaExecutorIds.length())]
print(f'Killing executors {executorIds}')
spark._jsparkSession.sparkContext().killExecutors(javaExecutorIds)
I hope that saves someone else from excessive googling.我希望这可以避免其他人过度使用谷歌搜索。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.