简体繁体 English

关于Hive中的引擎，Map Reduce和Spark之间有什么区别？

[英]What is the difference between Map Reduce and Spark about engine in Hive?

原文 2019-07-03 03:37:50 9 2 apache-spark/ hadoop/ hive

It looks like there are two ways to use spark as the backend engine for Hive. 看起来有两种方法可以将spark用作Hive的后端引擎。

The first one is directly using spark as the engine. 第一个是直接使用spark作为引擎。 Like this tutorial . 喜欢本教程。

Another way is to use spark as the backend engine for MapReduce . 另一种方法是使用spark作为MapReduce的后端引擎。 Like this tutorial . 喜欢本教程。

In the first tutorial, the hive.execution.engine is spark . 在第一个教程中， hive.execution.engine是spark 。 And I cannot see hdfs involved. 而且我看不到涉及的hdfs 。

In the second tutorial, the hive.execution.engine is still mr , but as there is no hadoop process, it looks like the backend of mr is spark. 在第二个教程中， hive.execution.engine仍然是mr ，但是由于没有hadoop进程，因此看起来mr的后端是spark。

Honestly, I'm a little bit confused about this. 老实说，我对此有些困惑。 I guess the first one is recommended as mr has been deprecated. 我想推荐第一个，因为mr已被弃用。 But where is the hdfs involved? 但是hdfs涉及哪里？

2 个解决方案

I understood it differently. 我有不同的理解。

Normally Hive uses MR as execution engine, unless you use IMPALA, but not all distros have this. 通常，除非您使用IMPALA，否则Hive将MR用作执行引擎，但并非所有发行版都具有此功能。

But for a period now Spark can be used as execution engine for Spark. 但是一段时间以来，Spark可以用作Spark的执行引擎。

https://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/ discusses this in more detail. https://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/对此进行了更详细的讨论。

Apache Spark builds DAG(Directed acyclic graph) whereas Mapreduce goes with native Map and Reduce. Apache Spark构建DAG（有向无环图），而Mapreduce与本机Map和Reduce一起使用。 While execution in Spark, logical dependencies form physical dependencies. 在Spark中执行时，逻辑依赖性形成物理依赖性。

Now what is DAG ? 现在什么是DAG ？

DAG is building logical dependencies before execution.(Think of it as a visual graph) When we have multiple map and reduce or output of one reduce is the input to another map then DAG will help to speed up the jobs. DAG在执行之前先构建逻辑依赖项（以可视化图的形式考虑）。当我们有多个映射并且reduce或一个reduce的输出是另一张映射的输入时，DAG将帮助加速作业。 DAG is build in Tez (right side of photo) but not in MapReduce (left side). DAG是在Tez（照片的右侧）中构建的，而不是在MapReduce（左侧）中构建的。

NOTE: Apache Spark works on DAG but have stages in place of Map/Reduce. 注意： Apache Spark可在DAG上运行，但具有代替Map / Reduce的阶段。 Tez have DAG and works on Map/Reduce. Tez有DAG，可以在Map / Reduce上工作。 In order to make it simpler i used Map/Reduce context but remember Apache Spark have stages. 为了简化起见，我使用了Map / Reduce上下文，但请记住Apache Spark有阶段。 But the concept of DAG remains the same. 但是DAG的概念保持不变。

Reason 2: Map persists its output to disk.(buffer too but when 90% of it is filled then output goes into disk) From there data goes to merge. 原因2： Map会将其输出持久保存到磁盘。（也有缓冲区，但是当90％的缓冲区被填充时，输出将进入磁盘）从那里开始数据合并。 But in Apache Spark intermediate data is persist to memory which makes it faster. 但是在Apache Spark中，中间数据会保留在内存中，从而使其速度更快。 Check this link for details 检查此链接以获取详细信息