[英]Approach to reduce the execution time of a Hive query
We run this below query daily and this query runs for 3 hours or so, owing due to sheer volume of data in the transaction table. 由于交易表中的数据量很大,我们每天在下面的查询中运行此查询,此查询运行3个小时左右。 Is there any way we can tune this query or reduce the execution time?
有什么方法可以调整此查询或减少执行时间?
CREATE TEMPORARY TABLE t1 AS
SELECT DISTINCT EVENT_DATE FROM (
SELECT DISTINCT EVENT_DATE FROM mstr_wrk.cust_transation
WHERE load_date BETWEEN CAST(CAST('2019-03-05 04:00:31.0' AS TIMESTAMP) AS DATE) AND CURRENT_DATE() AND event_title = 'SETUP'
AND state != 'INACTIVE' AND mode != 'DORMANT') T
I tried to reduce the number of reducers to help speed up, and also tried to enable vectorization but not much luck here. 我试图减少减速器的数量以帮助加快速度,并且还尝试实现矢量化,但是在这里运气并不好。 We are running on tez.
我们正在tez上运行。
Redesign table and use INDEXes. 重新设计表并使用INDEX。
For example I would use a numeric 'state' column or enumerative and also a numeric or enumerative 'event' column. 例如,我将使用数字“状态”列或枚举,以及数字或枚举“事件”列。 This can help to make efficient indexes, instead of varchar or text types.
这可以帮助创建有效的索引,而不是varchar或文本类型。
Indexes dramatically improve speed queries up if queries are using them. 如果查询正在使用索引,索引将大大提高查询速度。
Anyway not knowing the table structure and the number of the records involved, I am just guessing... 无论如何,不知道表的结构和所涉及的记录数,我只是在猜测...
mstr_wrk.cust_transation
is partitioned by load_date
, partition pruning will not work because you are using functions. mstr_wrk.cust_transation
被划分load_date
,因为你使用的功能分区修剪将无法正常工作。 This will cause table full scan. Check this script performance before parametrizing your script 在参数化脚本之前检查此脚本性能
CREATE TEMPORARY TABLE t1 AS
SELECT DISTINCT EVENT_DATE FROM mstr_wrk.cust_transation
WHERE load_date >= '2019-03-05' AND load_date <= '2019-03-07'
AND event_title = 'SETUP'
AND state != 'INACTIVE' AND mode != 'DORMANT'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.