简体繁体 English

和yarn的区别

[英]Difference between using hdfs:// and yarn in spark

原文 2016-03-28 08:14:39 1 1 apache-spark/ hdfs/ yarn

在集群模式下使用hdfs：//和yarn在spark中保存和加载保存文件有什么区别？

1 个解决方案

From your question here , I apparently guess your understanding on HDFS and YARN is incorrect. 从这里的问题来看，我显然猜想您对HDFS和YARN的理解不正确。

YARN is a generic job scheduling framework and HDFS is a storage framework. YARN是一个通用的作业调度框架，而HDFS是一个存储框架。

YARN in a nut shell has a master(Resource Manager) and workers(Node manager), 坚果壳中的YARN具有一个master（资源管理器）和worker（节点管理器），

The resource manager creates containers on workers to execute MapReduce jobs, spark jobs etc. 资源管理器在工作人员上创建容器以执行MapReduce作业，Spark作业等。

HDFS on the other hand has a master(Name Node) and worker(Data Node) to persist and retrieve files. 另一方面，HDFS具有一个主节点（名称节点）和辅助节点（数据节点）以持久化和检索文件。

You don't need YARN to communicate with HDFS, it is an independent entity. 您不需要YARN与HDFS进行通信，它是一个独立的实体。

In production environment HDFS worker(Data node) and YARN worker(Node manager) are installed in a single machine so that the processing framework can consume the data from the nearest local data node(Data Locality). 在生产环境中，HDFS worker（数据节点）和YARN worker（节点管理器）安装在单台计算机上，以便处理框架可以使用最近的本地数据节点（Data Locality）中的数据。

Using spark on a YARN cluster in cluster mode means one of the worker nodes within the YARN cluster acts as client to submit the spark job. 在群集模式下在YARN群集上使用spark意味着YARN群集中的工作节点之一充当客户端来提交spark作业。

Hence using hdfs:// would obviously benefit the spark job as the spark executor would read the data from the nearest data node. 因此，使用hdfs：//显然将使spark作业受益，因为spark执行程序将从最近的数据节点读取数据。

The YARN and HDFS configurations would be read from HADOOP_CONF_DIR on the client machine(can be you local machine in client mode and one of the worker nodes in cluster mode). YARN和HDFS配置将从客户端计算机上的HADOOP_CONF_DIR读取（可以是客户端模式下的本地计算机，而群集模式下的工作节点之一）。