[英]Hive Create Table Partitions from file name
New to Hadoop. Hadoop的新手。 I know how to create a table in Hive (Syntax) Creating a table with 3 Partition Key. 我知道如何在Hive(语法)中创建表。使用3 Partition Key创建表。 but the keys are in File Names. 但是键在“文件名”中。
FileName Example : ServerName_ApplicationName_ApplicationName.XXXX.log.YYYY-MM-DD FileName示例:ServerName_ApplicationName_ApplicationName.XXXX.log.YYYY-MM-DD
there are hundreds of file in a directory want to create a table with following Partition Keys from file Name :ServerName,ApplicationName,Date and load all the files in to table Hive Script would be the preference but open to any other ideas 目录中有数百个文件想要从文件名:ServerName,ApplicationName,Date创建具有以下分区键的表,并将所有文件加载到表中Hive Script将是首选项,但可以接受其他想法
(files are CSV. and I know The schema(column definitions) of the file ) (文件是CSV。我知道文件的架构(列定义))
I assume the File Name is in format ServerName_ApplicationName.XXXX.log.YYYY-MM-DD (removed second "applicationname" assuming it to be a typo). 我假设文件名的格式为ServerName_ApplicationName.XXXX.log.YYYY-MM-DD(删除第二个“应用程序名称”,假设它是拼写错误)。
Create a table on the contents of the original file. 在原始文件的内容上创建一个表。 Some thing like.. 就像是..
create external table default.stack
(col1 string,
col2 string,
col3 string,
col4 int,
col5 int
)
ROW FORMAT DELIMITED
FIELDS terminated by ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'hdfs://nameservice1/location1...';
Create another partitioned table in another location like 在另一个位置创建另一个分区表,例如
create external table default.stack_part
(col1 string,
col2 string,
col3 string,
col4 int,
col5 int
)
PARTITIONED BY ( servername string, applicationname string, load_date string)
STORED as AVRO -- u can choose any format for the final file
location 'hdfs://nameservice1/location2...';
Insert into partitioned table from base table using below query: 使用以下查询从基本表插入分区表:
set hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.compress.output=true;
set hive.exec.parallel=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
Insert overwrite table default.stack_part
partition ( servername, applicationname, load_date)
select *,
split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[0] as servername
,split(split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[1],'[.]')[0] as applicationname
,split(split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[1],'[.]')[3] as load_date
from default.stack;
I have tested this and it works. 我已经对此进行了测试,并且有效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.