简体繁体 English

什么是 hadoop（单节点和多节点）、spark-master 和 spark-worker？

[英]What is hadoop (single and multi) nodes, spark-master and spark-worker?

原文 2016-05-05 07:40:49 2 2 apache-spark/ hadoop/ hdfs

I want to understand the following terms:我想了解以下术语：

hadoop (single-node and multi-node) spark master spark worker namenode datanode hadoop（单节点和多节点） spark master spark worker namenode datanode

What I understood so far is spark master is the job executor and handles all the spark workers.到目前为止我所理解的是 spark master 是作业执行者并处理所有 spark 工人。 Whereas hadoop is the hdfs (where our data resides) and from where spark workers reads data according to the job given to them.而 hadoop 是 hdfs（我们的数据所在的位置），spark 工作人员根据分配给他们的工作从中读取数据。 Please correct me if I wrong.如果我错了，请纠正我。

I also want to understand the role of namenode and datanode.我也想了解namenode和datanode的作用。 Though I know the role of namenode (having the metadata info of all datanodes and it should be only one preferably, but could be two) and datanodes could be multiple and having the data.虽然我知道 namenode 的作用（拥有所有数据节点的元数据信息，最好只有一个，但也可以是两个），而数据节点可以是多个并拥有数据。

Are datanodes the same hadoop nodes? datanodes 是相同的 hadoop 节点吗？

2 个解决方案

SPARK Architecture:星火架构：

Spark uses a master/worker architecture . Spark 使用master/worker 架构。 There is a driver that talks to a single coordinator called master that manages workers in which executors run.有一个驱动程序与一个名为 master 的协调器进行对话，该协调器管理执行器在其中运行的工作线程。

The driver and the executors run in their own Java processes.驱动程序和执行程序在它们自己的 Java 进程中运行。 You can run them all on the same (horizontal cluster) or separate machines (vertical cluster) or in a mixed machine configuration.您可以在同一台（水平集群）或单独的机器（垂直集群）或混合机器配置中运行它们。

Node are nothing but the physical machines.节点只不过是物理机器。

Hadoop NameNode and DataNode: Hadoop NameNode 和 DataNode：

HDFS has a master/slave architecture. HDFS具有主/从架构。 An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. HDFS 集群由单个 NameNode 组成，NameNode 是一个主服务器，用于管理文件系统命名空间并管理客户端对文件的访问。 In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.此外，还有许多 DataNode，通常集群中的每个节点一个，用于管理连接到它们运行的节点的存储。 HDFS exposes a file system namespace and allows user data to be stored in files. HDFS 公开了一个文件系统命名空间，并允许将用户数据存储在文件中。 Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.在内部，文件被分成一个或多个块，这些块存储在一组 DataNode 中。 The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. NameNode 执行文件系统命名空间操作，如打开、关闭和重命名文件和目录。 It also determines the mapping of blocks to DataNodes.它还确定块到 DataNode 的映射。 The DataNodes are responsible for serving read and write requests from the file system's clients. DataNode 负责处理来自文件系统客户端的读写请求。 The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. DataNode 还根据 NameNode 的指令执行块的创建、删除和复制。

Yeah, DataNodes are the slave node in Hadoop cluster.是的，DataNodes 是 Hadoop 集群中的从节点。

Please refer the documentation for more details.有关更多详细信息，请参阅文档。

Hadoop single-node Hadoop cluster with 1 Namenode(master) and 1 Datanode(slave).具有 1 个 Namenode（主）和 1 个 Datanode（从）的Hadoop 单节点Hadoop 集群。 Namenode have all the metadata and assigns for to slaves datanodes where data is stored and processing is done. Namenode 拥有所有元数据，并将其分配给存储数据并完成处理的从属数据节点。

Hadoop multi-node Hadoop cluster with 1 Namenode(master) and n Datanode(slave)具有 1 个 Namenode（主）和 n 个 Datanode（从）的Hadoop 多节点Hadoop 集群

spark master Same as Namenode in HDFS spark master与 HDFS 中的 Namenode 相同

spark worker Same as datanode but spark worker is only meant for processing not storing data. spark worker与 datanode 相同，但 spark worker 仅用于处理而不是存储数据。

To put thing in context(simple) - If there is 1 Namenode and 2 datanode(1GB memory) cluster.将事情放在上下文中（简单） - 如果有 1 个 Namenode 和 2 个 datanode（1GB 内存）集群。 A 2 GB file will be split and stored on datanodes.一个 2 GB 的文件将被拆分并存储在数据节点上。 Similarly to spark job will be split to process this data on individual datanodes(workers) in parallel.类似于火花作业将被拆分以并行处理各个数据节点（工作人员）上的这些数据。