Hive从文件名创建表分区

Question

New to Hadoop. Hadoop的新手。 I know how to create a table in Hive (Syntax) Creating a table with 3 Partition Key. 我知道如何在Hive（语法）中创建表。使用3 Partition Key创建表。 but the keys are in File Names. 但是键在“文件名”中。

FileName Example : ServerName_ApplicationName_ApplicationName.XXXX.log.YYYY-MM-DD FileName示例：ServerName_ApplicationName_ApplicationName.XXXX.log.YYYY-MM-DD

there are hundreds of file in a directory want to create a table with following Partition Keys from file Name :ServerName,ApplicationName,Date and load all the files in to table Hive Script would be the preference but open to any other ideas 目录中有数百个文件想要从文件名：ServerName，ApplicationName，Date创建具有以下分区键的表，并将所有文件加载到表中Hive Script将是首选项，但可以接受其他想法

(files are CSV. and I know The schema(column definitions) of the file ) （文件是CSV。我知道文件的架构（列定义））

Answer 1

I assume the File Name is in format ServerName_ApplicationName.XXXX.log.YYYY-MM-DD (removed second "applicationname" assuming it to be a typo). 我假设文件名的格式为ServerName_ApplicationName.XXXX.log.YYYY-MM-DD（删除第二个“应用程序名称”，假设它是拼写错误）。

Create a table on the contents of the original file. 在原始文件的内容上创建一个表。 Some thing like.. 就像是..

create external table default.stack
(col1 string,
 col2 string,
 col3 string,
 col4 int,
 col5 int
 )
 ROW FORMAT DELIMITED
 FIELDS terminated  by ','
 STORED AS INPUTFORMAT                                                  
  'org.apache.hadoop.mapred.TextInputFormat'                           
OUTPUTFORMAT                                                           
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
 location 'hdfs://nameservice1/location1...';

Create another partitioned table in another location like 在另一个位置创建另一个分区表，例如

create external table default.stack_part
(col1 string,
 col2 string,
 col3 string,
 col4 int,
 col5 int
 )
 PARTITIONED BY ( servername string, applicationname string, load_date string)
 STORED as AVRO  -- u can choose any format for the final file
 location 'hdfs://nameservice1/location2...';

Insert into partitioned table from base table using below query: 使用以下查询从基本表插入分区表：

set hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.compress.output=true;
set hive.exec.parallel=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

Insert overwrite table default.stack_part
partition ( servername, applicationname, load_date)
select *, 
       split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[0] as servername
       ,split(split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[1],'[.]')[0] as applicationname
       ,split(split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[1],'[.]')[3] as load_date
from default.stack;

I have tested this and it works. 我已经对此进行了测试，并且有效。

Hive从文件名创建表分区

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-12-02 19:26:02

Hive从文件名创建表分区

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-12-02 19:26:02

解决方案1
3 已采纳 2015-12-02 19:26:02