简体   繁体   中英

Hive: dynamic partition adding to external table

I am running hive 071, processing existing data which is has the following directory layout:
-TableName
- d= (eg 2011-08-01)
- d=2011-08-02
- d=2011-08-03
... etc
under each date I have the date files.
now to load the data I'm using

CREATE EXTERNAL TABLE table_name (i int)  
PARTITIONED BY (date String)  
LOCATION '${hiveconf:basepath}/TableName';**  

I would like my hive script to be able to load the relevant partitions according to some input date, and number of days. so if I pass date='2011-08-03' and days='7'
The script should load the following partitions - d=2011-08-03
- d=2011-08-04
- d=2011-08-05
- d=2011-08-06
- d=2011-08-07
- d=2011-08-08
- d=2011-08-09

I havn't found any discent way to do it except explicitlly running:

ALTER TABLE table_name ADD PARTITION (d='2011-08-03');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-04');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-05');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-06');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-07');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-08');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-09');  

and then running my query

select count(1) from table_name;

however this is offcourse not automated according to the date and days input

Is there any way I can define to the external table to load partitions according to date range, or date arithmetics?

I have a very similar issue where, after a migration, I have to recreate a table for which I have the data, but not the metadata. The solution seems to be, after recreating the table:

MSCK REPAIR TABLE table_name;

Explained here

This also mentions the "alter table X recover partitions" that OP commented on his own post. MSCK REPAIR TABLE table_name; works on non-Amazon-EMR implementations (Cloudera in my case).

I do not believe there is any built-in functionality for this in Hive. You may be able to write a plugin. Creating custom UDFs

Probably do not need to mention this, but have you considered a simple bash script that would take your parameters and pipe the commands to hive?

Oozie workflows would be another option, however that might be overkill. Oozie Hive Extension - After some thinking I dont think Oozie would work for this.

The partitions are a physical segmenting of the data - where the partition is maintained by the directory system, and the queries use the metadata to determine where the partition is located. so if you can make the directory structure match the query, it should find the data you want. for example:

select count(*) from table_name where (d >= '2011-08-03) and (d <= '2011-08-09');

but I do not know of any date-range operations otherwise, you'll have to do the math to create the query pattern first.

you can also create external tables, and add partitions to them that define the location. This allows you to shred the data as you like, and still use the partition scheme to optimize the queries.

I have explained the similar scenario in my blog post:

1) You need to set properties:

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

2)Create a external staging table to load the input files data in to this table.

3) Create a main production external table "production_order" with date field as one of the partitioned columns.

4) Load the production table from the staging table so that data is distributed in partitions automatically.

Explained the similar concept in the below blog post. If you want to see the code.

http://exploredatascience.blogspot.in/2014/06/dynamic-partitioning-with-hive.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM