减少Hive查询执行时间的方法

Question

We run this below query daily and this query runs for 3 hours or so, owing due to sheer volume of data in the transaction table. 由于交易表中的数据量很大，我们每天在下面的查询中运行此查询，此查询运行3个小时左右。 Is there any way we can tune this query or reduce the execution time? 有什么方法可以调整此查询或减少执行时间？

   CREATE TEMPORARY TABLE t1 AS
    SELECT DISTINCT EVENT_DATE FROM (
      SELECT DISTINCT EVENT_DATE FROM mstr_wrk.cust_transation
      WHERE load_date BETWEEN CAST(CAST('2019-03-05 04:00:31.0' AS TIMESTAMP) AS DATE) AND CURRENT_DATE() AND  event_title = 'SETUP'
      AND state != 'INACTIVE' AND mode != 'DORMANT') T

I tried to reduce the number of reducers to help speed up, and also tried to enable vectorization but not much luck here. 我试图减少减速器的数量以帮助加快速度，并且还尝试实现矢量化，但是在这里运气并不好。 We are running on tez. 我们正在tez上运行。

Answer 1

Redesign table and use INDEXes. 重新设计表并使用INDEX。

For example I would use a numeric 'state' column or enumerative and also a numeric or enumerative 'event' column. 例如，我将使用数字“状态”列或枚举，以及数字或枚举“事件”列。 This can help to make efficient indexes, instead of varchar or text types. 这可以帮助创建有效的索引，而不是varchar或文本类型。

Indexes dramatically improve speed queries up if queries are using them. 如果查询正在使用索引，索引将大大提高查询速度。

Anyway not knowing the table structure and the number of the records involved, I am just guessing... 无论如何，不知道表的结构和所涉及的记录数，我只是在猜测...

Answer 2

You do not need to apply DISTINCT two times 您无需两次申请DISTINCT
If table mstr_wrk.cust_transation is partitioned by load_date , partition pruning will not work because you are using functions. 如果表mstr_wrk.cust_transation被划分load_date ，因为你使用的功能分区修剪将无法正常工作。 This will cause table full scan. 这将导致表完全扫描。 Calculate dates in the shell script and pass as a parameters 在shell脚本中计算日期并作为参数传递

Check this script performance before parametrizing your script 在参数化脚本之前检查此脚本性能

  CREATE TEMPORARY TABLE t1 AS
      SELECT DISTINCT EVENT_DATE FROM mstr_wrk.cust_transation
      WHERE load_date >= '2019-03-05' AND load_date <= '2019-03-07' 
            AND  event_title = 'SETUP'
            AND state != 'INACTIVE' AND mode != 'DORMANT'

减少Hive查询执行时间的方法

问题描述

2 个解决方案

解决方案1
0 2019-03-07 14:19:55

解决方案2
0 已采纳 2019-03-07 14:47:39

减少Hive查询执行时间的方法

问题描述

2 个解决方案

解决方案1 0 2019-03-07 14:19:55

解决方案2 0 已采纳 2019-03-07 14:47:39

解决方案1
0 2019-03-07 14:19:55

解决方案2
0 已采纳 2019-03-07 14:47:39