了解Spark：Cluster Manager，Master和Driver节点

Question

Having read this question , I would like to ask additional questions: 阅读完这个问题后，我想提出更多问题：

The Cluster Manager is a long-running service, on which node it is running? Cluster Manager是一个长期运行的服务，它在哪个节点上运行？
Is it possible that the Master and the Driver nodes will be the same machine? Master和Driver节点是否可能是同一台机器？ I presume that there should be a rule somewhere stating that these two nodes should be different? 我认为应该有一个规则说明这两个节点应该是不同的？
In case where the Driver node fails, who is responsible of re-launching the application? 如果Driver节点出现故障，谁负责重新启动应用程序？ and what will happen exactly? 什么会发生什么？ ie how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order? 即主节点，Cluster Manager和Workers节点将如何参与（如果它们）以及以何种顺序？
Similarly to the previous question: In case where the Master node fails, what will happen exactly and who is responsible of recovering from the failure? 与上一个问题类似：如果主节点出现故障，将会发生什么，以及谁负责从故障中恢复？

Answer 1

1. The Cluster Manager is a long-running service, on which node it is running? 1. Cluster Manager是一个长期运行的服务，它在哪个节点上运行？

Cluster Manager is Master process in Spark standalone mode. Cluster Manager是 Spark独立模式下的主进程 。 It can be started anywhere by doing ./sbin/start-master.sh , in YARN it would be Resource Manager. 它可以通过./sbin/start-master.sh在任何地方启动，在YARN中它将是资源管理器。

2. Is it possible that the Master and the Driver nodes will be the same machine? 2. Master和Driver节点是否可能是同一台机器？ I presume that there should be a rule somewhere stating that these two nodes should be different? 我认为应该有一个规则说明这两个节点应该是不同的？

Master is per cluster, and Driver is per application. Master是每个群集， Driver是每个应用程序。 For standalone/yarn clusters, Spark currently supports two deploy modes. 对于独立/纱线群集，Spark目前支持两种部署模式。

In client mode, the driver is launched in the same process as the client that submits the application . 在客户端模式下，驱动程序在与提交应用程序的客户端相同的进程中启动 。
In cluster mode , however, for standalone, the driver is launched from one of the Worker & for yarn , it is launched inside application master node and the client process exits as soon as it fulfils its responsibility of submitting the application without waiting for the app to finish. 但是， 在集群模式下，对于独立模式， 驱动程序从其中一个Worker ＆for yarn启动，它在应用程序主节点内启动，客户端进程在完成其提交应用程序的责任时退出，而无需等待应用程序完成。

If an application submitted with --deploy-mode client in Master node, both Master and Driver will be on the same node . 如果在主节点中使用--deploy-mode client提交的应用程序，则Master和Driver都将位于同一节点上 。 check deployment of Spark application over YARN 检查通过YARN部署Spark应用程序

3. In the case where the Driver node fails, who is responsible for re-launching the application? 3.如果Driver节点出现故障，谁负责重新启动应用程序？ And what will happen exactly? 什么会发生什么？ ie how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order? 即主节点，Cluster Manager和Workers节点将如何参与（如果它们）以及以何种顺序？

If the driver fails, all executors tasks will be killed for that submitted/triggered spark application. 如果驱动程序失败，则将针对该提交/触发的spark应用程序终止所有执行程序任务。

4. In the case where the Master node fails, what will happen exactly and who is responsible for recovering from the failure? 4.在主节点发生故障的情况下，将会发生什么以及谁负责从故障中恢复？

Master node failures are handled in two ways. 主节点故障以两种方式处理。

Standby Masters with ZooKeeper: 使用ZooKeeper的备用大师：

Utilizing ZooKeeper to provide leader election and some state storage, you can launch multiple Masters in your cluster connected to the same ZooKeeper instance. 利用ZooKeeper提供领导者选举和一些状态存储，您可以在连接到同一ZooKeeper实例的群集中启动多个Masters。 One will be elected “leader” and the others will remain in standby mode. 一个将被选为“领导者”，其他人将保持待命模式。 If the current leader dies, another Master will be elected, recover the old Master's state, and then resume scheduling. 如果当前领导者死亡，将选出另一个主人，恢复旧主人的状态，然后恢复安排。 The entire recovery process (from the time the first leader goes down) should take between 1 and 2 minutes. 整个恢复过程（从第一个领导者关闭时起）应该需要1到2分钟。 Note that this delay only affects scheduling new applications – applications that were already running during Master failover are unaffected. 请注意，此延迟仅影响计划新应用程序 - 在主故障转移期间已运行的应用程序不受影响。 check here for configurations 在这里查看配置
Single-Node Recovery with Local File System: 使用本地文件系统进行单节点恢复：

ZooKeeper is the best way to go for production-level high availability, but if you want to be able to restart the Master if it goes down, FILESYSTEM mode can take care of it. ZooKeeper是获得生产级别高可用性的最佳方式，但如果您希望能够在主服务器关闭时重新启动主服务器，则FILESYSTEM模式可以处理它。 When applications and Workers register, they have enough state written to the provided directory so that they can be recovered upon a restart of the Master process. 当应用程序和工作人员注册时，他们有足够的状态写入提供的目录，以便在重新启动主进程时可以恢复它们。 check here for conf and more details 在这里查看conf和更多细节

Answer 2

The Cluster Manager is a long-running service, on which node it is running? Cluster Manager是一个长期运行的服务，它在哪个节点上运行？

A cluster manager is just a manager of resources, ie CPUs and RAM, that SchedulerBackends use to launch tasks. 集群管理器只是SchedulerBackends用于启动任务的资源（即CPU和RAM）的管理器。 A cluster manager does nothing more to Apache Spark, but offering resources, and once Spark executors launch, they directly communicate with the driver to run tasks. 集群管理器不再对Apache Spark做任何事情，而是提供资源，一旦Spark执行程序启动，它们就会直接与驱动程序通信以运行任务。

You can start a standalone master server by executing: 您可以通过执行以下命令启动独立主服务器：

./sbin/start-master.sh

Can be started anywhere. 可以在任何地方开始。

To run an application on the Spark cluster 在Spark群集上运行应用程序

./bin/spark-shell --master spark://IP:PORT

Is it possible that the Master and the Driver nodes will be the same machine? Master和Driver节点是否可能是同一台机器？ I presume that there should be a rule somewhere stating that these two nodes should be different? 我认为应该有一个规则说明这两个节点应该是不同的？

In standalone mode, when you start your machine certain JVM will start.Your SparK Master will start up and on each machine Worker JVM will start and they will register with the Spark Master. 在独立模式下，当您启动计算机时，某些JVM将启动。您的SparK Master将启动并在每台计算机上启动Worker JVM，它们将向Spark Master注册。 Both are the resource manager.When you start your application or submit your application in cluster mode a Driver will start up wherever you do ssh to start that application. 两者都是资源管理器。当您启动应用程序或以群集模式提交应用程序时，驱动程序将在您启动该应用程序的任何地方启动。 Driver JVM will contact to the SparK Master for executors(Ex) and in standalone mode Worker will start the Ex. 驱动程序JVM将与执行程序（Ex）的SparK Master联系，在独立模式下，Worker将启动Ex。 So Spark Master is per cluster and Driver JVM is per application. 所以Spark Master是每个集群，驱动程序JVM是每个应用程序。

In case where the Driver node fails, who is responsible of re-launching the application? 如果Driver节点出现故障，谁负责重新启动应用程序？ and what will happen exactly? 什么会发生什么？ ie how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order? 即主节点，Cluster Manager和Workers节点将如何参与（如果它们）以及以何种顺序？

If a Ex JVM will crashes the Worker JVM will start the Ex and when Worker JVM ill crashes Spark Master will start them. 如果Ex JVM崩溃，那么Worker JVM将启动Ex，当Worker JVM发生故障时，Spark Master将启动它们。 And with a Spark standalone cluster with cluster deploy mode, you can also specify --supervise to make sure that the driver is automatically restarted if it fails with non-zero exit code.Spark Master will start Driver JVM 使用具有集群部署模式的Spark独立集群，您还可以指定--supervise以确保驱动程序在失败且退出代码为非零时自动重新启动.Spark Master将启动驱动程序JVM

Similarly to the previous question: In case where the Master node fails, what will happen exactly and who is responsible of recovering from the failure? 与上一个问题类似：如果主节点出现故障，将会发生什么，以及谁负责从故障中恢复？

failing on master will result in executors not able to communicate with it. 在master上失败将导致执行程序无法与之通信。 So, they will stop working. 所以，他们将停止工作。 Failing of master will make driver unable to communicate with it for job status. 主人失败将使司机无法与其进行通信以获得工作状态。 So, your application will fail. 因此，您的应用程序将失败。 Master loss will be acknowledged by the running applications but otherwise these should continue to work more or less like nothing happened with two important exceptions: 正在运行的应用程序将确认主要丢失，但除非两个重要的例外情况，否则这些应该继续或多或少地发挥作用：

1.application won't be able to finish in elegant way. 1.应用程序无法以优雅的方式完成。

2.if Spark Master is down Worker will try to reregisterWithMaster. 2.如果Spark Master关闭，Worker将尝试重新注册WithMaster。 If this fails multiple times workers will simply give up. 如果多次失败，工人就会放弃。

reregisterWithMaster()-- Re-register with the active master this worker has been communicating with. reregisterWithMaster（） - 重新注册此工作者一直在与之通信的活动主服务器。 If there is none, then it means this worker is still bootstrapping and hasn't established a connection with a master yet, in which case we should re-register with all masters. 如果没有，那么这意味着该工作者仍然在引导并且还没有与主服务器建立连接，在这种情况下我们应该重新注册所有主服务器。 It is important to re-register only with the active master during failures.worker unconditionally attempts to re-register with all masters, will may arise race condition.Error detailed in SPARK-4592: 在故障期间仅重新注册活动主机是很重要的。工作人员无条件地尝试重新注册所有主人，可能会出现竞争条件.SPARK-4592中详述的错误：

At this moment long running applications won't be able to continue processing but it still shouldn't result in immediate failure. 此时长时间运行的应用程序将无法继续处理，但仍不应导致立即失败。 Instead application will wait for a master to go back on-line (file system recovery) or a contact from a new leader (Zookeeper mode), and if that happens it will continue processing. 相反，应用程序将等待主服务器返回联机状态（文件系统恢复）或来自新领导者的联系人（Zookeeper模式），如果发生这种情况，它将继续处理。

了解Spark：Cluster Manager，Master和Driver节点

问题描述

2 个解决方案

解决方案1
15 已采纳 2016-11-12 06:06:46

解决方案2
5 2016-11-15 10:25:31

了解Spark：Cluster Manager，Master和Driver节点

问题描述

2 个解决方案

解决方案1 15 已采纳 2016-11-12 06:06:46

解决方案2 5 2016-11-15 10:25:31

解决方案1
15 已采纳 2016-11-12 06:06:46

解决方案2
5 2016-11-15 10:25:31