Hive Query Efficiency

Question

Could you help me with a Hive Query Efficiency problem? I have two queries working for the same problem. I just cannot figure out why one is much faster than the other. If you know please feel free to provide insight. Any info is welcomed!

Problem : I am trying to check the minimum value of a bunch of variables in a Hive parquet table.

Queries : I tried two two queries as follows:

query 1

drop table if exists tb_1 purge;
create table if not exists tb_1 as
select 'v1' as name, min(v1) as min_value from src_tb union all
select 'v2' as name, min(v2) as min_value from src_tb union all
select 'v3' as name, min(v3) as min_value from src_tb union all
...
select 'v200' as name, min(v200) as min_value from src_tb
;

query 2

drop table if exists tb_2 purge;
create table if not exists tb_2 as
select min(v1) as min_v1
, min(v2) as min_v2
, min(v3) as min_v3
...
, min(v200) as min_v200
from src_tb
;

Result : Query 2 is much faster than query 1. It took probably 5 mins to finish the second query. I don't know how long will query 1 take. But after I submit the first query, it took a long time to even react to the query, by which I mean that usually after I submit a query, the system will start to analyze and provides some compiling information in the terminal. However, for my first query, after my submission, the system won't even react to this. So I just killed it.

What do you think? Thank you in advance.

Answer 1

Query execution time depends on environment that you execute it.

In MSSQL .

Some people like you think query execution is similar to algorithm that they see in some theoretical resources, but in practical situation, it depends on other things.

For example both of your queries have SELECT statement that perform on a table and at first glance, they need to read all rows, but database server must analyze the statement to determine the most efficient way to extract the requested data. This is referred to as optimizing the SELECT statement. The component that does this is called the Query Optimizer . The input to the Query Optimizer consists of the query, the database schema (table and index definitions), and the database statistics. The output of the Query Optimizer is a query execution plan , sometimes referred to as a query plan or just a plan. (Please see this for more information about query-processing architecture)

You can see execution plan in MSSQL by reading this article and I think you will understand better by seeing execution plan for both of your queries.

Edit (Hive)

Hive provides an EXPLAIN command that shows the execution plan for a query. The syntax for this statement is as follows:

EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query

A Hive query gets converted into a sequence of stages. The description of the stages itself shows a sequence of operators with the metadata associated with the operators.

Please see LanguageManual Explain for more information.

Answer 2

What is surprising? The first query has to read src_tb a total of 200 times. The second reads the data once and performs 200 aggregations. It is a no brainer that it is faster.

Hive Query Efficiency

Question

2 answers

solution1
4 ACCPTED 2018-03-07 19:41:01

solution2
1 2018-02-28 22:53:47

Hive Query Efficiency

Question

2 answers

solution1 4 ACCPTED 2018-03-07 19:41:01

solution2 1 2018-02-28 22:53:47

solution1
4 ACCPTED 2018-03-07 19:41:01

solution2
1 2018-02-28 22:53:47