简体   繁体   中英

Hive - partition external table by content of data

I have a bunch of gzipped files in HDFS under directories of the form /home/myuser/salesdata/some_date/ALL/<country>.gz , for instance /home/myuser/salesdata/20180925/ALL/us.gz

The data is of the form

<country> \t count1,count2,count3

So essentially it's first tab separated and then I need to extract the comma separated values into separate columns

I'd like to create an external table, partitioning this by country, year, month and day. The size of the data is pretty huge, potentially 100s of TB and so I'd like to have an external table itself, rather than having to duplicate the data by importing it into a standard table.

Is it possible to achieve this by using only an external table?

considering your country is separated by tab '\\t' and other fields separated by , this is what you can do.

You can create a temporary table which has first columns as string and rest as array.

create external table temp.test_csv (country string, count array<int>)
row format delimited
fields terminated by "\t"
collection items terminated by ',' 
stored as textfile
location '/apps/temp/table';

Now if you drop your files into the /apps/temp/table location you should be able to to select the data as mentioned below.

select country, count[0] as count_1, count[1] count_2, count[2] count_3 from temp.test_csv

Now to create partitions create another table, as mentioned below.

drop table temp.test_csv_orc;
create table temp.test_csv_orc ( count_1 int, count_2 int, count_3 int) 
partitioned by(year string, month string, day string, country string) 
stored as orc; 

And load the data from temporary table into this one.

insert into temp.test_csv_orc partition(year="2018", month="09", day="28", country) 
select count[0] as count_1, count[1] count_2, count[2] count_3, country from temp.test_csv

I have taken country as Dynamic Parition as it's coming from file however others aren't so it's static.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM