简体   繁体   中英

hive not using partition to select data in external table

A weird behaviour I never saw.

A have parquet files in my HDFS generated by Spark, split into 3 partitions.

Here is the count for all partitions. It was calculated by Spark .

+----------+--------+
|created_at|count(1)|
+----------+--------+
|  20190101|12774895|
|  20181231|18648432|
|  20190102|30010065|
+----------+--------+

Total: 61433392 records

Hive

Created a table.

CREATE EXTERNAL TABLE raw.event (
    account_type STRING,
    event_name STRING,
    payload MAP<STRING, STRING>
)
PARTITIONED BY(created_at INT, product_name STRING, event_type STRING)
STORED AS PARQUET LOCATION '/datalake/raw/event'
TBLPROPERTIES('PARQUET.COMPRESS'='SNAPPY');

Added all partitions.

show partitions raw.event;

+---------------------------------------------------------------------+--+
|                              partition                              |
+---------------------------------------------------------------------+--+
| created_at=20181231/product_name=A/event_type=X  |
| created_at=20190101/product_name=A/event_type=X  |
| created_at=20190102/product_name=A/event_type=X  |
+---------------------------------------------------------------------+--+

Execute a full count to make sure everything run smoothly:

0: jdbc:hive2://headnodehost:10001/> select count(*) from raw.event;

+-----------+--+
|    _c0    |
+-----------+--+
| 61433392  |
+-----------+--+

Expected result! Ok :)

Now count just from one partition.

0: jdbc:hive2://headnodehost:10001/> select count(*) from raw.event where created_at=20190102 and product_name='A' and event_type='X';

Instead get the expected 12774895 count rows I am still getting the full count.

+-----------+--+
|    _c0    |
+-----------+--+
| 61433392  |
+-----------+--+

Now I am trying to count and group by the created_at partition.

0: jdbc:hive2://headnodehost:10001/> SELECT created_at, count(*) FROM raw.event group by created_at;

+-------------+-----------+--+
| created_at  |    _c1    |
+-------------+-----------+--+
| 20190102    | 61433392  |
+-------------+-----------+--+

I am still getting the full count of rows, and just the last added partition.

I am pretty sure the parquet files contains different created_at values.

Avoid using stats, also does not help.

set hive.compute.query.using.stats=false;

Analyse table also does not help:

0: jdbc:hive2://headnodehost:10001/> ANALYZE TABLE raw.event PARTITION(created_at, product_name, event_type) COMPUTE STATISTICS;

INFO  : Partition raw.event{created_at=20181231, product_name=A, event_type=X} stats: [numFiles=111, numRows=**61433392**, totalSize=19637211108, rawDataSize=1965868544]
INFO  : Partition raw.event{created_at=20190101, product_name=A, event_type=X} stats: [numFiles=111, numRows=0, totalSize=19637211108, rawDataSize=0]
INFO  : Partition raw.event{created_at=20190102, product_name=A, event_type=X} stats: [numFiles=111, numRows=0, totalSize=19637211108, rawDataSize=0]

Forcing a partition brings the same result.

0: jdbc:hive2://headnodehost:10001/> ANALYZE TABLE raw.event PARTITION(created_at=20190102, product_name, event_type) COMPUTE STATISTICS;

INFO  : Partition raw.event{created_at=20190102, product_name=A, event_type=X} stats: [numFiles=111, numRows=**61433392**, totalSize=19637211108, rawDataSize=1965868544]

Any suggestion?

This is probably related to how you added the partitions. It looks like all your data is under the same directory in the HDFS. Partitions point to a specific location in the hdfs (not multiple locations), so I would assume 1 partition is pointing to the whole dataset. You can run the following command to verify partition location

describe formatted raw.event  partition ( created_at=20181231,product_name="A",event_type="X");

Also, partitions do not depend on the data itself or even directory, location and value depend on how you added them to the table

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM