简体   繁体   中英

Difference between using hdfs:// and yarn in spark

在集群模式下使用hdfs://和yarn在spark中保存和加载保存文件有什么区别?

From your question here , I apparently guess your understanding on HDFS and YARN is incorrect.

YARN is a generic job scheduling framework and HDFS is a storage framework.

YARN in a nut shell has a master(Resource Manager) and workers(Node manager),

The resource manager creates containers on workers to execute MapReduce jobs, spark jobs etc.

HDFS on the other hand has a master(Name Node) and worker(Data Node) to persist and retrieve files.

You don't need YARN to communicate with HDFS, it is an independent entity.

In production environment HDFS worker(Data node) and YARN worker(Node manager) are installed in a single machine so that the processing framework can consume the data from the nearest local data node(Data Locality).

Using spark on a YARN cluster in cluster mode means one of the worker nodes within the YARN cluster acts as client to submit the spark job.

Hence using hdfs:// would obviously benefit the spark job as the spark executor would read the data from the nearest data node.

The YARN and HDFS configurations would be read from HADOOP_CONF_DIR on the client machine(can be you local machine in client mode and one of the worker nodes in cluster mode).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM