简体   繁体   中英

What is the difference between Map Reduce and Spark about engine in Hive?

It looks like there are two ways to use spark as the backend engine for Hive.

The first one is directly using spark as the engine. Like this tutorial .

Another way is to use spark as the backend engine for MapReduce . Like this tutorial .

In the first tutorial, the hive.execution.engine is spark . And I cannot see hdfs involved.

In the second tutorial, the hive.execution.engine is still mr , but as there is no hadoop process, it looks like the backend of mr is spark.

Honestly, I'm a little bit confused about this. I guess the first one is recommended as mr has been deprecated. But where is the hdfs involved?

I understood it differently.

Normally Hive uses MR as execution engine, unless you use IMPALA, but not all distros have this.

But for a period now Spark can be used as execution engine for Spark.

https://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/ discusses this in more detail.

Apache Spark builds DAG(Directed acyclic graph) whereas Mapreduce goes with native Map and Reduce. While execution in Spark, logical dependencies form physical dependencies.

Now what is DAG ?

DAG is building logical dependencies before execution.(Think of it as a visual graph) When we have multiple map and reduce or output of one reduce is the input to another map then DAG will help to speed up the jobs. 在此处输入图片说明 DAG is build in Tez (right side of photo) but not in MapReduce (left side).

NOTE: Apache Spark works on DAG but have stages in place of Map/Reduce. Tez have DAG and works on Map/Reduce. In order to make it simpler i used Map/Reduce context but remember Apache Spark have stages. But the concept of DAG remains the same.

Reason 2: Map persists its output to disk.(buffer too but when 90% of it is filled then output goes into disk) From there data goes to merge. But in Apache Spark intermediate data is persist to memory which makes it faster. Check this link for details

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM