简体   繁体   中英

Converting Sub query to a single query Hive

I have a query which needs a count of colA groupbed by colB witha particular average of COlC value. for example

SELECT COUNT( X.colA ), X.colB , X.MEASURE
FROM (
  SELECT colA  , colB  , avg(colC) MEASURE
  FROM tableA
  GROUP BY colA, colB
  HAVING round(avg(colC),2) > 0
) X 
GROUP BY X.MEASURE , X.colB
HAVING X.MEASURE BETWEEN 0 AND 3000
ORDER BY MEASURE

Example result could be

No of User, URL    , average time spent
90182     , abc.com,    334
293556    , def.com,     33

Problem with above query is that since it has a sub query the inner sub query shuffles a huge amount of data as a intermediate result to outer query which results in query becoming very slow on large data sets.

Is there a way we can convert above query to a query without any sub -query or is there any UDAF available so there is no more major shuffle of intermediate data and it runs in a single stage ?

I don't see an easy way to simplify the query. But, moving the having clause into the subquery might help the performance a bit:

SELECT COUNT( X.colA ), X.colB , X.MEASURE
FROM (SELECT colA  , colB  , avg(colC) MEASURE
      FROM tableA
      GROUP BY colA, colB
      HAVING round(avg(colC),2) > 0 and avg(colC) <= 3000
     ) X 
GROUP BY X.MEASURE , X.colB
ORDER BY MEASURE;

You want to aggregate by the average of a set of rows. This seems to require two operations -- one to calculate the average and the other for the final aggregation.

Actually, if I think about it, the following might do what you want:

select colB, count(distinct colA), sum(colC) / count(distinct colA) as measure
from tableA
group by colA
having sum(colC) / count(distinct colA) between 0 and 3000
order by measure;

It is not exactly the same, but I don't understand the purpose of grouping by measure in the outer query. Perhaps a summary with one row for each b value would be sufficient.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM