I have an external table table1
created in HDFS containing single partition column column1
of type string
and I am using Hive to get data from it.
Following query finishes in 1 second as expected as the data is present in Hive metastore itself.
SHOW PARTITIONS table1;
The result of above command also makes sure that all partitions are present in metastore. I have also run MSCK REPAIR TABLE table1
to make sure all partition info is present in metastore. But below query takes 10 min to complete.
SELECT min(column1) from table1;
Why is this query doing full mapreduce tasks just to determine the minimum value of partition column1
when all the values are already present in metastore?
There is 1 more use-case where Hive is checking full Table data and not making use of partition information. SELECT * FROM (SELECT * FROM table1 WHERE column1='abc') q1 INNER JOIN (SELECT * FROM table1 WHERE column1='xyz') q2 ON q1.column2==q2.column2
In such queries also, Hive does not make use of partition info and is scanning all partitions like column1='jkl'
Any pointer about this behaviour? I am not sure if above 2 scenarios are due to same reason.
Its because the way data is stored and accessed.
SHOW PARTITIONS table1;
is taking 1 sec because this data coming straight from metadata.table.SELECT min(column1) from table1;
is taking minutes because this data is coming from HDFS and calculated after hive goes through all the actual data.explain SELECT min(column1) from table1;
, you will see that query is going through all the partitions( and all the data) and then finding min value. This is as good as checking all data to find min value. Pls note partition isnt an index but its different physical folders to store data files for quicker access.If you run explain sql, you will see SQL is accessing all partition in case of min() sql (i created partitions on random college_marks column)-
29
Path -> Alias:
30
hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=10.0 [tmp]
31
hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=50.0 [tmp]
32
Path -> Partition:
33
hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=10.0
34
Partition
35
base file name: college_marks=10.0
36
input format: org.apache.hadoop.mapred.TextInputFormat
37
hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=50.0
85
Partition
86
base file name: college_marks=50.0
87
input format: org.apache.hadoop.mapred.TextInputFormat
88
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
89
partition values:
90
college_marks 50.0
91
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.