简体   繁体   中英

Table Partitioned by Timestamp Field

In order to generate some summary figures we are importing data periodically to Hive. We are currently using a CSV file format and its layout is as follows:

operation,item,timestamp,user,marketingkey

Currently we have a few queries that are performing grouping over date (yyyy-mm-dd) of timestamp field.

The files that are being imported are holding sometimes more days and I would like to store it in a partitioned way. Is there a way to do it with Hive, I have build the table based on the following DDL:

CREATE TABLE 
   partitionedTable (name string) 
PARTITIONED BY (time bigint) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

The data loading was done like:

LOAD DATA LOCAL INPATH 
   '/home/spaeth/tmp/hadoop-billing-data/extracted/testData.csv' 
INTO TABLE partitionedTable PARTITION(time='2013-05-01');

But I would like that hive applies the partitioning in an automatic way based on a field that comes within the file that is being imported. For example:

login,1,1370793184,user1,none --> stored to partition 2013-06-09
login,2,1360793184,user1,none --> stored to partition 2013-02-13
login,1,1360571184,user2,none --> stored to partition 2013-02-11
buy,2,1360501184,user2,key1   --> stored to partition 2013-02-10

It seems like you are looking for dynamic partitioning, and Hive supports dynamic partition inserts as detailed in this article .

First, you need to create a temporary table where you will put your flat data with no partition at all. In your case this would be:

CREATE TABLE 
    flatTable (type string, id int, ts bigint, user string, key string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Then, you should load your flat data file into this directory:

LOAD DATA LOCAL INPATH
    '/home/spaeth/tmp/hadoop-billing-data/extracted/testData.csv'
INTO TABLE flatTable;

At that point you can use the dynamic partition insert. A few things to keep in mind are that you'll need the following properties:

  • hive.exec.dynamic.partition should be set to true because dynamic partition is disabled by default I believe.
  • hive.exec.dynamic.partition.mode should be set to nonstrict because you have a single partition and strict mode enforces that you need one static partition.

So you can run the following query:

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
FROM
    flatTable
INSERT OVERWRITE TABLE
    partitionedTable
PARTITION(time)
SELECT
    user, from_unixtime(ts, 'yyyy-MM-dd') AS time

This should spawn 2 MapReduce jobs, and at the end you should see something along the lines of:

Loading data to table default.partitionedtable partition (time=null)
    Loading partition {time=2013-02-10}
    Loading partition {time=2013-02-11}
    Loading partition {time=2013-02-13}
    Loading partition {time=2013-06-09}

And to verify that your partitions are indeed here:

$ hadoop fs -ls /user/hive/warehouse/partitionedTable/
Found 4 items
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-10
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-11
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-13
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-06-09

Please note that dynamic partitions are only supported since Hive 0.6, so if you have an older version this is probably not going to work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM