简体   繁体   English

Hive从文件名创建表分区

[英]Hive Create Table Partitions from file name

New to Hadoop. Hadoop的新手。 I know how to create a table in Hive (Syntax) Creating a table with 3 Partition Key. 我知道如何在Hive(语法)中创建表。使用3 Partition Key创建表。 but the keys are in File Names. 但是键在“文件名”中。

FileName Example : ServerName_ApplicationName_ApplicationName.XXXX.log.YYYY-MM-DD FileName示例:ServerName_ApplicationName_ApplicationName.XXXX.log.YYYY-MM-DD

there are hundreds of file in a directory want to create a table with following Partition Keys from file Name :ServerName,ApplicationName,Date and load all the files in to table Hive Script would be the preference but open to any other ideas 目录中有数百个文件想要从文件名:ServerName,ApplicationName,Date创建具有以下分区键的表,并将所有文件加载到表中Hive Script将是首选项,但可以接受其他想法

(files are CSV. and I know The schema(column definitions) of the file ) (文件是CSV。我知道文件的架构(列定义))

I assume the File Name is in format ServerName_ApplicationName.XXXX.log.YYYY-MM-DD (removed second "applicationname" assuming it to be a typo). 我假设文件名的格式为ServerName_ApplicationName.XXXX.log.YYYY-MM-DD(删除第二个“应用程序名称”,假设它是拼写错误)。

Create a table on the contents of the original file. 在原始文件的内容上创建一个表。 Some thing like.. 就像是..

create external table default.stack
(col1 string,
 col2 string,
 col3 string,
 col4 int,
 col5 int
 )
 ROW FORMAT DELIMITED
 FIELDS terminated  by ','
 STORED AS INPUTFORMAT                                                  
  'org.apache.hadoop.mapred.TextInputFormat'                           
OUTPUTFORMAT                                                           
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
 location 'hdfs://nameservice1/location1...';

Create another partitioned table in another location like 在另一个位置创建另一个分区表,例如

create external table default.stack_part
(col1 string,
 col2 string,
 col3 string,
 col4 int,
 col5 int
 )
 PARTITIONED BY ( servername string, applicationname string, load_date string)
 STORED as AVRO  -- u can choose any format for the final file
 location 'hdfs://nameservice1/location2...';

Insert into partitioned table from base table using below query: 使用以下查询从基本表插入分区表:

set hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.compress.output=true;
set hive.exec.parallel=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

Insert overwrite table default.stack_part
partition ( servername, applicationname, load_date)
select *, 
       split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[0] as servername
       ,split(split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[1],'[.]')[0] as applicationname
       ,split(split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[1],'[.]')[3] as load_date
from default.stack;

I have tested this and it works. 我已经对此进行了测试,并且有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM