Hive查询仅花费大量时间来启动map-reduce作业

Question

We are using Hive for Ad-hoc querying and have a Hive table which is partitioned on two fields (date,id) . 我们正在使用Hive进行临时查询，并且拥有一个Hive表，该表分为两个字段(date,id) 。

Now for each date there are around 1400 ids so on a single day around that many partitions are added. 现在，对于每个日期，大约有1400个ID，因此在一天之内会添加许多分区。 The actual data is residing in s3. 实际数据位于s3中。 Now the issue we are facing is suppose we do a select count(*) for a month from the table then it takes quite a long amount of time (approx : 1hrs 52 min) just to launch the map reduce job. 现在，我们面临的问题是假设我们从表中进行一个月的select count(*) ，然后仅花费很长的时间（大约1小时52分钟）即可启动地图精简作业。

When I ran the query in Hive verbose mode I can see that its spending this time actually deciding how many number of mappers to spawn (calculating splits). 当我在Hive verbose模式下运行查询时，我可以看到它这次的花费实际上决定了要生成多少个映射程序（计算拆分）。 Is there any means by which I can reduce this lag time for the launch of map-reduce job? 有什么方法可以减少启动地图减少工作的滞后时间？

This is one of the log messages that is being logged during this lag time: 这是此滞后时间内正在记录的日志消息之一：

13/11/19 07:11:06 INFO mapred.FileInputFormat: Total input paths to process : 1
13/11/19 07:11:06 WARN httpclient.RestS3Service: Response '/Analyze%2F2013%2F10%2F03%2F465' - Unexpected response code 404, expected 200

Answer 1

This is probably because with an over-partitioned table the query planning phase takes a long time. 这可能是因为使用过度分区的表会导致查询计划阶段花费很长时间。 Worse, the query planning phase itself might take longer than the query execution phase. 更糟糕的是，查询计划阶段本身可能比查询执行阶段花费更长的时间。

One way to overcome this problem would be to tune up your metastore . 解决此问题的一种方法是调整Metastore 。 But the better solution would be to devise an efficient schema and get rid of unnecessary partitions. 但是更好的解决方案是设计一个有效的架构并摆脱不必要的分区。 Trust me, you really don't want too many small partitions. 相信我，您真的不需要太多的小分区。

As an alternative you could also try setting hive.input.format to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat before you issue your query. 另外，您也可以尝试在发出查询之前将hive.input.format设置为org.apache.hadoop.hive.ql.io.CombineHiveInputFormat 。

HTH HTH

Hive查询仅花费大量时间来启动map-reduce作业

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-11-20 22:41:46

Hive查询仅花费大量时间来启动map-reduce作业

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-11-20 22:41:46

解决方案1
1 已采纳 2013-11-20 22:41:46