简体繁体 English

使用 Spark 与 HDFS 作为文件存储系统和 YARN 作为资源管理器有什么好处？

[英]What is the advantage of using spark with HDFS as file storage system and YARN as resource manager?

原文 2019-01-26 17:35:53 9 1 apache-spark/ hadoop/ hdfs

I am trying to understand if spark is an alternative to the vanilla MapReduce approach for analysis of BigData.我试图了解 spark 是否可以替代用于分析大数据的普通 MapReduce 方法。 Since spark saves the operations on the data in the memory so while using the HDFS as storage system for spark , does it take the advantage of distributed storage of the HDFS?由于spark将数据操作保存在内存中，所以spark使用HDFS作为存储系统时，是否利用了HDFS分布式存储的优势？ For instance suppose i have 100GB CSV file stored in HDFS, now i want to do analysis on it.例如，假设我有 100GB 的 CSV 文件存储在 HDFS 中，现在我想对其进行分析。 If i load that from HDFS to spark , will spark load the complete data in-memory to do the transformations or it will use the distributed environment for doing its jobs that HDFS provides for Storage which is leveraged by the MapReduce programs written in hadoop.如果我将它从 HDFS 加载到 spark ，将在内存中加载完整的数据以进行转换，或者它将使用分布式环境来完成 HDFS 为存储提供的工作，该存储由用 hadoop 编写的 MapReduce 程序利用。 If not then what is the advantage of using spark over HDFS ?如果不是，那么在 HDFS 上使用 spark 的优势是什么？

PS: I know spark spills on the disks if there is RAM overflow but does this spill occur for data per node(suppose 5 GB per node) of the cluster or for the complete data(100GB)? PS：如果存在 RAM 溢出，我知道磁盘上会出现火花溢出，但是这种溢出是针对集群的每个节点（假设每个节点 5 GB）的数据还是针对完整数据（100GB）发生的？

1 个解决方案

Spark jobs can be configured to spill to local executor disk, if there is not enough memory to read your files.如果没有足够的内存来读取文件，可以将 Spark 作业配置为溢出到本地执行程序磁盘。 Or you can enable HDFS snapshots and caching between Spark stages.或者，您可以在 Spark 阶段之间启用 HDFS 快照和缓存。

You mention CSV, which is just a bad format to have in Hadoop in general.您提到了 CSV，这在 Hadoop 中通常是一种糟糕的格式。 If you have 100GB of CSV, you could just as easily have less than half that if written in Parquet or ORC...如果您有 100GB 的 CSV，那么如果用 Parquet 或 ORC 编写，您可以轻松获得不到一半的 CSV 文件...

At the end of the day, you need some processing engine, and some storage layer.归根结底，您需要一些处理引擎和一些存储层。 For example, Spark on Mesos or Kubernetes might work just as well as on YARN, but those are separate systems, and are not bundled and tied together as nicely as HDFS and YARN.例如，Mesos 或 Kubernetes 上的 Spark 可能与 YARN 上的工作一样好，但它们是独立的系统，并且不像 HDFS 和 YARN 那样捆绑和捆绑在一起。 Plus, like MapReduce, when using YARN, you are moving the execution to the NodeManagers on the datanodes, rather than pulling over data over the network, which you would be doing with other Spark execution modes.另外，与 MapReduce 一样，当使用 YARN 时，您将执行移动到数据节点上的 NodeManager，而不是通过网络拉取数据，而您在其他 Spark 执行模式下会这样做。 The NameNode and ResourceManagers coordinate this communication for where data is stored and processed NameNode 和 ResourceManagers 协调此通信以存储和处理数据

If you are convinced that MapReduceV2 can be better than Spark, I would encourage looking at Tez instead如果您确信 MapReduceV2 可以比 Spark 更好，我会鼓励您查看 Tez