简体   繁体   中英

Performance benchmarking between Hive (on Tez) and Spark for my particular use case

I'm playing around with some data on cluster and want to do some aggregations --- nothing too complicated, but more complicated than sum, there are few joins and count distincts. I have implemented this aggregation in Hive and Spark with Scala and want to compare the execution times.

When I submit the scripts from gateway, linux time functions gives me real time smaller than sys time, which I expected. But I'm not sure which one to pick as proper comparision. Maybe just use sys.time and run the both queries for several times? Is it acceptable or I'm complete noob in this case?

Real time. From a performance benchmark perspective, you only care about how long (human time) it takes before your query is completed and you can look at the results, not how many processes are getting spun up by the application internally.

Note, I would be very careful with performance benchmarking, as both Spark and Hive have plenty of tunable configuration knobs that greatly affect performance. See here for a few examples to alter Hive performance with vectorization, data format choices, data bucketing and data sorting.

The "general consensus" is that Spark is faster than Hive on Tez, but that Hive can handle huge data sets that don't fit in memory better. (I'm not going to cite a source since I'm lazy, do some googling)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM