简体   繁体   中英

Differences in Execution betwen Hive and Spark

All: I am looking for someone with more knowledge to check my understanding of Hive and Spark

I have been researching different large scale database solutions and I am trying to understand the difference in execution between Hive and Spark. I attempted to install Hadoop, Hive, and Spark to see how they perform. I was able to get Hadoop and Spark to work. I was unable to get Hive to work.

When I ran queries in Spark after they passed through the optimizer, it seems that the biggest advantage is that only the relevant table data is selected from the source at the earliest inception. So if I only needed Table1.columns(A,B,C) in the final answer, but told the system to JOIN Table1 & Table2 on (Table1.A=Table2.B) it immediately reduces the carried table to only the relevant items...I do not think Hive performs that way. I believe it will do the full join and perform the reduction later.

There are also differences in the memory storage (Hive going back the the HDFS frequently, vs Spark keeping things in RAM). This has both advantages and disadvantages depending on the data set/query.

Unfortunately because I cannot get Hive to run, my theory is based off of reading outputs of other people running things in Hive.

Hive on Tez and Spark both use Ram(memory) for operating on data. The number of partitions computed which will be treated as individual tasks would be quite different from Hive on Tez vs Spark. Hive on Tez by default tries to use combiner to merge certain splits into single partition. Hive one Tez seem to handle autoscaling of clusters in a better way than spark and does work most of the time.Spark doesn't work with autoscaling it would have lot of shuffle errors and will fail when there are multiple stages. But given a fixed size of cluster Spark seems to perform better over Hive on TEZ this could be attributed to some of the optimizations done and also how the shuffle,serialization etc are implemented.

I Think hive and spark originally have different goals, and their execution styles are based on those goals.

Apache spark is a framework that allows you to do calculations on big datasets. stored on hdfs

Hive is an SQL interface to retriev data stored in an hdfs, and other clusterized and object store filesystems (S3 is an example) in a structured way.

Spark keeps things on ram because its more focused on making calculations with the data sets. Hive is more focused on retrieving data in a structured way, so it does not focus on speed that much (that being said, there have been improvements in hive, like llap that are meant to improve performance).

I like to use analogies with traditional software tools. On one side, you can have a relational database, and on the other side, a programming language. They both overlap in some functionality (you can write and read to disk with the programming language, and you can do some calculations with the sql engine. However, if the task at hand requires intensive and complex calculations you would probably use the programming language. If you are looking for a system that lets you store data in a structured way, you would go for the sql engine.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM