简体   繁体   中英

Hive query running infinite time on Tez

Below is the query which I am trying to run on Hive with execution engine as tez.

SELECT A.CITY, A.NAME, B.PRICE
  ,(ROW_NUMBER() OVER (PARTITION BY A.NAME ORDER BY B.PRICE) ) AS RNUM
  FROM TABLE1 A
  LEFT JOIN TABLE2 B 
   ON A.NAME = B.NAME 
   WHERE ( A.COLUMN2 >= B.COLUMN3 AND A.COLUMN2 < B.COLUMN4)
GROUP BY A.CITY, A.NAME, B.PRICE;
  • When I run above query on my data in Hive it keeps running for hours without any results, but syntactically above query is correct.
  • Both TABLEA and TABLEB has millions of records.

I tried changing the data format, increasing the container size, changing the number of reducers and changing heap size. Whatever parameter I change the query is being stuck.

In my further investigation I noticed that the where condition and window function are causing the query to run infinitely.

Here is my question:

  • What is the change that will make the query run end to end on huge data?

Thanks for your help

I guess this is not caused by memory allocation or reducer count. Could be caused by data skew. Analyze from that point also. This link would help: https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

The link below gives insights on skewed tables and list bucketing. Don't miss to read.

https://cwiki.apache.org/confluence/display/Hive/ListBucketing

Thanks!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM