简体   繁体   中英

Hive Table partition with column in the middle

I want to create a table with partitions from existing files on Hadoop. The datevalue on which I need to partition is available in the files, but the datevalue column position is not last. It is in the middle. How can I create the table for the same?

Here is the sample:

1  John    2012-01-10 Miller  
2  Austin  2012-02-22 Powers

Let's take an example: you want to have a partitioned hive table with three columns (id INT, fname STRING, dt STRING, lname STRING) where id , fname , lname are columns that store an integer id, string first name and a string last name respecitvely and dt is a partition column of type string that contains date in yyyy-MM-dd format. To create a table like this you would issue a command like:

CREATE EXTERNAL TABLE my_table (id INT, fname STRING, lname STRING)
PARTITIONED BY (dt STRING)
LOCATION '/usr/hive/warehouse/my_table';

When you insert data into this table (via INSERT OVERWRITE command, say) and go check the HDFS location (/usr/hive/warehouse/my_table), you would find that the data is stored in directories; one directory per partition. The name of the directory would be something like dt=2012-01-01 or dt=2012-02-22 . Inside these directories would be your actual data in whatever format you had selected it to be stored in. The partition column is not stored with this data; it is a virtual column that is deciphered from the partition directory your data is present in.

Now let's get to your question. Since partitioning column is a virtual column, you can not put a partitioned Hive table on top of your data as it is (regardless of whether your to-be-partitioning column is present in the middle of the file or at the end). You need the appropriate directory structure to be present in HDFS for partitioning to work. You would want to create a staging table that is not partitioned.

CREATE EXTERNAL TABLE my_table_staging (id INT, fname STRING, dt STRING, lname STRING)
LOCATION '/usr/hive/warehouse/my_table_staging';

Then use this staging table as a source to populate your partitioned table using Dynamic partitioning. You can use a command like the one below for this:

INSERT OVERWRITE TABLE my_table PARTITION (dt)
SELECT id, fname, lname, dt FROM my_table_staging;

This command will read the data from your staging table and insert it into the partitioned table, creating the appropriate directory structure for you on the HDFS.

References: - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL - https://cwiki.apache.org/Hive/dynamicpartitions.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM