简体   繁体   中英

Approach to reduce the execution time of a Hive query

We run this below query daily and this query runs for 3 hours or so, owing due to sheer volume of data in the transaction table. Is there any way we can tune this query or reduce the execution time?

   CREATE TEMPORARY TABLE t1 AS
    SELECT DISTINCT EVENT_DATE FROM (
      SELECT DISTINCT EVENT_DATE FROM mstr_wrk.cust_transation
      WHERE load_date BETWEEN CAST(CAST('2019-03-05 04:00:31.0' AS TIMESTAMP) AS DATE) AND CURRENT_DATE() AND  event_title = 'SETUP'
      AND state != 'INACTIVE' AND mode != 'DORMANT') T

I tried to reduce the number of reducers to help speed up, and also tried to enable vectorization but not much luck here. We are running on tez.

Redesign table and use INDEXes.

For example I would use a numeric 'state' column or enumerative and also a numeric or enumerative 'event' column. This can help to make efficient indexes, instead of varchar or text types.

Indexes dramatically improve speed queries up if queries are using them.

Anyway not knowing the table structure and the number of the records involved, I am just guessing...

  1. You do not need to apply DISTINCT two times
  2. If table mstr_wrk.cust_transation is partitioned by load_date , partition pruning will not work because you are using functions. This will cause table full scan. Calculate dates in the shell script and pass as a parameters

Check this script performance before parametrizing your script

  CREATE TEMPORARY TABLE t1 AS
      SELECT DISTINCT EVENT_DATE FROM mstr_wrk.cust_transation
      WHERE load_date >= '2019-03-05' AND load_date <= '2019-03-07' 
            AND  event_title = 'SETUP'
            AND state != 'INACTIVE' AND mode != 'DORMANT'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM