Why Hive query over partition info (supposed to be stored in metastore) takes so long time

Question

I have an external table table1 created in HDFS containing single partition column column1 of type string and I am using Hive to get data from it.

Following query finishes in 1 second as expected as the data is present in Hive metastore itself.

SHOW PARTITIONS table1;

The result of above command also makes sure that all partitions are present in metastore. I have also run MSCK REPAIR TABLE table1 to make sure all partition info is present in metastore. But below query takes 10 min to complete.

SELECT min(column1) from table1;

Why is this query doing full mapreduce tasks just to determine the minimum value of partition column1 when all the values are already present in metastore?

There is 1 more use-case where Hive is checking full Table data and not making use of partition information. SELECT * FROM (SELECT * FROM table1 WHERE column1='abc') q1 INNER JOIN (SELECT * FROM table1 WHERE column1='xyz') q2 ON q1.column2==q2.column2

In such queries also, Hive does not make use of partition info and is scanning all partitions like column1='jkl'

Any pointer about this behaviour? I am not sure if above 2 scenarios are due to same reason.

Answer 1

Its because the way data is stored and accessed.

why SHOW PARTITIONS table1; is taking 1 sec because this data coming straight from metadata.table.
why SELECT min(column1) from table1; is taking minutes because this data is coming from HDFS and calculated after hive goes through all the actual data.
To test it out, if you run this explain SELECT min(column1) from table1; , you will see that query is going through all the partitions( and all the data) and then finding min value. This is as good as checking all data to find min value. Pls note partition isnt an index but its different physical folders to store data files for quicker access.

If you run explain sql, you will see SQL is accessing all partition in case of min() sql (i created partitions on random college_marks column)-

29
      Path -> Alias:
30
        hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=10.0 [tmp]
31
        hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=50.0 [tmp]
32
      Path -> Partition:
33
        hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=10.0 
34
          Partition
35
            base file name: college_marks=10.0
36
            input format: org.apache.hadoop.mapred.TextInputFormat
37
       hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=50.0 
85
          Partition
86
            base file name: college_marks=50.0
87
            input format: org.apache.hadoop.mapred.TextInputFormat
88
            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
89
            partition values:
90
              college_marks 50.0
91

Why Hive query over partition info (supposed to be stored in metastore) takes so long time

Question

1 answers

solution1
0 2022-11-25 19:06:18

Why Hive query over partition info (supposed to be stored in metastore) takes so long time

Question

1 answers

solution1 0 2022-11-25 19:06:18

solution1
0 2022-11-25 19:06:18