简体繁体 English

Spark Standalone、YARN 和本地模式有什么区别？

[英]What is the difference between Spark Standalone, YARN and local mode?

原文 2016-10-13 04:03:06 3 3 apache-spark

Spark Standalone:火花独立：

In this mode I realized that you run your Master and worker nodes on your local machine.在这种模式下，我意识到你在本地机器上运行你的主节点和工作节点。

Does that mean you have an instance of YARN running on my local machine?这是否意味着您在我的本地计算机上运行了一个 YARN 实例？ Since when I installed Spark it came with Hadoop and usually YARN also gets shipped with Hadoop as well correct?因为当我安装 Spark 时，它与 Hadoop 一起提供，通常 YARN 也与 Hadoop 一起提供，对吗？ And in this mode I can essentially simulate a smaller version of a full blown cluster.在这种模式下，我基本上可以模拟一个完整集群的较小版本。

Spark Local Mode:火花本地模式：

This is the part I am also confused on.这是我也很困惑的部分。 To run it in this mode I do val conf = new SparkConf().setMaster("local[2]") .要以这种模式运行它，我执行val conf = new SparkConf().setMaster("local[2]") 。

In this mode, it doesn't use any type of resource manager (like YARN) correct?在这种模式下，它不使用任何类型的资源管理器（如 YARN），对吗？ Like it simply just runs the Spark Job in the number of threads which you provide to "local[2]"\ ?就像它只是以您提供给"local[2]"\的线程数运行 Spark 作业？

3 个解决方案

You are getting confused with Hadoop YARN and Spark. 您对Hadoop YARN和Spark感到困惑。

YARN is a software rewrite that decouples MapReduce's resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications. YARN是一种软件重写，可将MapReduce的资源管理和调度功能与数据处理组件分离，使Hadoop能够支持更多样化的处理方法和更广泛的应用程序。

With the introduction of YARN, Hadoop has opened to run other applications on the platform. 随着YARN的推出，Hadoop已经开放在平台上运行其他应用程序。

In short YARN is "Pluggable Data Parallel framework". 简而言之，YARN是“可插拔数据并行框架”。

Apache Spark Apache Spark

Apache spark is a Batch interactive Streaming Framework. Apache spark是批量交互式流式框架。 Spark has a "pluggable persistent store". Spark有一个“可插拔的持久存储”。 Spark can run with any persistence layer. Spark可以与任何持久层一起运行。

For spark to run it needs resources. 为了运行火花，它需要资源。 In standalone mode you start workers and spark master and persistence layer can be any - HDFS, FileSystem, cassandra etc. In YARN mode you are asking YARN-Hadoop cluster to manage the resource allocation and book keeping. 在独立模式下，您可以启动工作程序，Spark主持人和持久层可以是任何一个--HDFS，FileSystem，cassandra等。在YARN模式中，您要求YARN-Hadoop集群管理资源分配和簿记。

When you use master as local[2] you request Spark to use 2 core's and run the driver and workers in the same JVM. 当你使用master作为local[2]你要求Spark使用2个核心并在同一个JVM中运行驱动程序和worker。 In local mode all spark job related tasks run in the same JVM. 在本地模式下，所有与spark作业相关的任务都在同一JVM中运行。

So the only difference between Standalone and local mode is that in Standalone you are defining "containers" for the worker and spark master to run in your machine (so you can have 2 workers and your tasks can be distributed in the JVM of those two workers?) but in local mode you are just running everything in the same JVM in your local machine. 因此，独立模式和本地模式之间的唯一区别在于，在Standalone中，您为工作人员定义“容器”，并在您的计算机中运行spark master（因此您可以拥有2个工作人员，并且您的任务可以在这两个工作人员的JVM中分发？）但在本地模式下，您只需在本地计算机的同一JVM中运行所有内容。

local mode 本地模式
Think of local mode as executing a program on your laptop using single JVM. 将本地模式视为使用单个JVM在笔记本电脑上执行程序。 It can be java, scala or python program where you have defined & used spark context object, imported spark libraries and processed data residing in your system. 它可以是java，scala或python程序，您可以在其中定义和使用spark上下文对象，导入的spark库和驻留在系统中的已处理数据。

YARN 纱
In reality Spark programs are meant to process data stored across machines. 实际上，Spark程序用于处理跨机器存储的数据。 Executors process data stored on these machines. 执行程序处理存储在这些机器上的数据。 We need a utility to monitor executors and manage resources on these machines( clusters). 我们需要一个实用程序来监视执行程序并管理这些机器（集群）上的资源。 Hadoop has its own resources manager for this purpose. 为此，Hadoop拥有自己的资源管理器。 So when you run spark program on HDFS you can leverage hadoop's resource manger utility ie yarn. 因此，当您在HDFS上运行spark程序时，您可以利用hadoop的资源管理器实用程序即纱线。 Hadoop properties is obtained from 'HADOOP_CONF_DIR' set inside spark-env.sh or bash_profile Hadoop属性是从spark-env.sh或bash_profile中设置的'HADOOP_CONF_DIR'获得的

Spark Standalone Spark Standalone
Spark distribution comes with its own resource manager also. Spark发布也有自己的资源管理器。 When your program uses spark's resource manager, execution mode is called Standalone. 当您的程序使用spark的资源管理器时，执行模式称为Standalone。 Moreover, Spark allows us to create distributed master-slave architecture, by configuring properties file under $SPARK_HOME/conf directory. 此外，Spark允许我们通过在$ SPARK_HOME / conf目录下配置属性文件来创建分布式主从架构。 By Default it is set as single node cluster just like hadoop's psudo-distribution-mode. 默认情况下，它被设置为单节点集群，就像hadoop的psudo-distribution-mode一样。

I would like to start with JVMs.我想从 JVM 开始。

So JVM is something that is installed in your system.所以 JVM 是安装在您的系统中的东西。 It is nothing but an interpreter and it is accompanied with some mandatory libraries.它只不过是一个解释器，并且伴随着一些强制性的库。 So JVM + libraries = JRE.所以 JVM + 库 = JRE。

Firstly any no of JVMs can be formed in your system.首先，您的系统中不能创建任何 JVM。 ie There can be many JVMs with their own resources allocated eg memory, cores etecetra.即可以有许多 JVM 分配了自己的资源，例如内存、内核等。 But they all share the interpreter, libraries because there is only one copy of them.但它们都共享解释器、库，因为它们只有一份。

Now, in local mode only one JVM takes care of the processing only one machine is used.现在，在本地模式下，只有一个 JVM 负责处理，只使用一台机器。 You can specify how many cores you need, which will in turn become the cores of that JVM.您可以指定需要多少个内核，这些内核又将成为该 JVM 的内核。 Since you have n number of cores assigned to local mode JVM.由于您已将 n 个内核分配给本地模式 JVM。 You can do partitioning as well here.您也可以在这里进行分区。 Maximum partition you can do to the data is no of cores, otherwise there wont be any advantage.您可以对数据进行的最大分区是没有内核，否则不会有任何优势。

Now in Standalone mode.现在处于独立模式。 You will create not one JVM but many.您将创建的不是一个 JVM，而是多个。 And the spark's own ResourceManager will handle the processing between these JVMs.而spark自己的ResourceManager会负责这些JVM之间的处理。 Since there are multiple JVMs you can have these JVMs in different machines as well.由于有多个 JVM，您也可以在不同的机器上安装这些 JVM。 It is a cluster mode.它是集群模式。

In cluster mode we use third party resouce managers like YARN/Mesos.在集群模式下，我们使用第三方资源管理器，如 YARN/Mesos。 You can do same thing as standalone mode.您可以执行与独立模式相同的操作。