简体   繁体   中英

Why spark-sql cpu utilization is higher than hive?

I am performing same query in both Hive and Spark SQL. We know that Spark is faster than hive, so i got the expected response time.

But when we consider about the CPU Utilization,

  • Spark process takes above >300%
  • while Hive takes near 150% for the same.

Is it the real nature of Spark and Hive?

  • What other metrics needs to be considered?
  • How to evaluate both in right way?

A big picture

Spark has no superpowers. The source of it advantage over MapReduce, is preference towards fast in-memory access, over slower out-of-core processing depending on distributed storage. So what it does it at its core is cutting off IO wait time.

Conclusion

Higher average CPU utilization is expected. Let's say you want to compute sum of N number. Independent of implementation asymptotic number of operations will be the same. However, if data is in-memory, you can expect lower total time and higher average CPU usage, while if data is on disk, you can expect higher total time and lower average CPU usage (higher IO wait).

Some remarks :

  • Spark and Hive are not designed with the same goals in mind. Spark is more ETL / streaming ETL tool, Hive database / data warehouse. This implies different optimization under the hood and performance can differ highly, depending on the workload.

    Comparing resource usage without the context doesn't make much sense.

  • In general Spark is less conservative and more resource hungry. It reflects both the design goals, as well as hardware evolution. Spark is a few years younger, and it is enough to see significant drop in the hardware cost.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM