简体繁体 English

Hive（在Tez上）和Spark之间针对我的特定用例进行的性能基准测试

[英]Performance benchmarking between Hive (on Tez) and Spark for my particular use case

原文 2016-12-30 17:47:36 9 1 hadoop/ apache-spark/ hive/ performance-testing

I'm playing around with some data on cluster and want to do some aggregations --- nothing too complicated, but more complicated than sum, there are few joins and count distincts. 我正在研究集群上的一些数据，并希望进行一些聚合---没什么太复杂了，但是比总和更复杂，联接很少，并且计数不同。 I have implemented this aggregation in Hive and Spark with Scala and want to compare the execution times. 我已经在Scive的Hive和Spark中实现了这种聚合，并想比较执行时间。

When I submit the scripts from gateway, linux time functions gives me real time smaller than sys time, which I expected. 当我从网关提交脚本时，Linux时间函数给我的实时时间小于我期望的sys时间。 But I'm not sure which one to pick as proper comparision. 但是我不确定该选择哪一个作为适当的比较。 Maybe just use sys.time and run the both queries for several times? 也许只是使用sys.time并运行两次查询几次？ Is it acceptable or I'm complete noob in this case? 在这种情况下可以接受还是我完全菜鸟？

1 个解决方案

Real time. 即时的。 From a performance benchmark perspective, you only care about how long (human time) it takes before your query is completed and you can look at the results, not how many processes are getting spun up by the application internally. 从性能基准的角度来看，您只关心查询完成之前需要花费多长时间（人工），您可以查看结果，而不是应用程序内部启动了多少个进程。

Note, I would be very careful with performance benchmarking, as both Spark and Hive have plenty of tunable configuration knobs that greatly affect performance. 请注意，我会非常谨慎地进行性能基准测试，因为Spark和Hive都有大量可调整的配置旋钮，这些旋钮会极大地影响性能。 See here for a few examples to alter Hive performance with vectorization, data format choices, data bucketing and data sorting. 请参阅此处的一些示例，以通过矢量化，数据格式选择，数据存储和数据排序来更改Hive性能。

The "general consensus" is that Spark is faster than Hive on Tez, but that Hive can handle huge data sets that don't fit in memory better. 普遍的共识是，Spark的速度比Hive on Tez快，但是Hive可以处理无法更好地存储在内存中的海量数据集。 (I'm not going to cite a source since I'm lazy, do some googling) （由于我很懒，所以我不会引用消息来源，请使用谷歌搜索）