简体   繁体   English

Hive:动态分区添加到外部表

[英]Hive: dynamic partition adding to external table

I am running hive 071, processing existing data which is has the following directory layout: 我正在运行配置单元071,处理具有以下目录布局的现有数据:
-TableName -TableName
- d= (eg 2011-08-01) - d =(例如2011-08-01)
- d=2011-08-02 - d = 2011-08-02
- d=2011-08-03 - d = 2011-08-03
... etc ......等
under each date I have the date files. 在每个日期下我都有日期文件。
now to load the data I'm using 现在加载我正在使用的数据

CREATE EXTERNAL TABLE table_name (i int)  
PARTITIONED BY (date String)  
LOCATION '${hiveconf:basepath}/TableName';**  

I would like my hive script to be able to load the relevant partitions according to some input date, and number of days. 我希望我的hive脚本能够根据一些输入日期和天数加载相关分区。 so if I pass date='2011-08-03' and days='7' 所以如果我通过date ='2011-08-03'days ='7'
The script should load the following partitions - d=2011-08-03 该脚本应加载以下分区- d = 2011-08-03
- d=2011-08-04 - d = 2011-08-04
- d=2011-08-05 - d = 2011-08-05
- d=2011-08-06 - d = 2011-08-06
- d=2011-08-07 - d = 2011-08-07
- d=2011-08-08 - d = 2011-08-08
- d=2011-08-09 - d = 2011-08-09

I havn't found any discent way to do it except explicitlly running: 除了明确地运行之外,我没有找到任何方法来做到这一点:

ALTER TABLE table_name ADD PARTITION (d='2011-08-03');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-04');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-05');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-06');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-07');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-08');  
ALTER TABLE table_name ADD PARTITION (d='2011-08-09');  

and then running my query 然后运行我的查询

select count(1) from table_name;

however this is offcourse not automated according to the date and days input 然而,根据输入的日期和日期,这不是自动的

Is there any way I can define to the external table to load partitions according to date range, or date arithmetics? 有没有什么办法可以根据日期范围或日期算术来定义外部表来加载分区?

I have a very similar issue where, after a migration, I have to recreate a table for which I have the data, but not the metadata. 我有一个非常类似的问题,在迁移之后,我必须重新创建一个表,我有数据,但没有元数据。 The solution seems to be, after recreating the table: 在重新创建表后,解决方案似乎是:

MSCK REPAIR TABLE table_name; MSCK REPAIR TABLE table_name;

Explained here 这里解释一下

This also mentions the "alter table X recover partitions" that OP commented on his own post. 这也提到了OP在他自己的帖子中评论的"alter table X recover partitions" MSCK REPAIR TABLE table_name; works on non-Amazon-EMR implementations (Cloudera in my case). 适用于非Amazon-EMR实现(在我的案例中为Cloudera)。

I do not believe there is any built-in functionality for this in Hive. 我不相信Hive中有任何内置功能。 You may be able to write a plugin. 您可以编写插件。 Creating custom UDFs 创建自定义UDF

Probably do not need to mention this, but have you considered a simple bash script that would take your parameters and pipe the commands to hive? 可能不需要提到这一点,但你有没有考虑过一个简单的bash脚本,它会带你的参数并将命令传递给hive?

Oozie workflows would be another option, however that might be overkill. Oozie工作流程将是另一种选择,但这可能是过度的。 Oozie Hive Extension - After some thinking I dont think Oozie would work for this. Oozie Hive Extension - 经过一番思考后我不认为Oozie会为此工作。

The partitions are a physical segmenting of the data - where the partition is maintained by the directory system, and the queries use the metadata to determine where the partition is located. 分区是数据的物理分段 - 其中分区由目录系统维护,查询使用元数据来确定分区的位置。 so if you can make the directory structure match the query, it should find the data you want. 因此,如果您可以使目录结构与查询匹配,它应该找到您想要的数据。 for example: 例如:

select count(*) from table_name where (d >= '2011-08-03) and (d <= '2011-08-09');

but I do not know of any date-range operations otherwise, you'll have to do the math to create the query pattern first. 但我不知道任何日期范围操作,否则,你必须先进行数学运算才能创建查询模式。

you can also create external tables, and add partitions to them that define the location. 您还可以创建外部表,并为它们添加定义位置的分区。 This allows you to shred the data as you like, and still use the partition scheme to optimize the queries. 这允许您根据需要粉碎数据,并仍然使用分区方案来优化查询。

I have explained the similar scenario in my blog post: 我在博文中解释了类似的情况:

1) You need to set properties: 1)您需要设置属性:

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

2)Create a external staging table to load the input files data in to this table. 2)创建外部登台表以将输入文件数据加载到此表中。

3) Create a main production external table "production_order" with date field as one of the partitioned columns. 3)创建一个主生产外部表“production_order”,其中日期字段作为分区列之一。

4) Load the production table from the staging table so that data is distributed in partitions automatically. 4)从登台表加载生产表,以便数据自动分布在分区中。

Explained the similar concept in the below blog post. 在下面的博客文章中解释了类似的概念。 If you want to see the code. 如果你想看代码。

http://exploredatascience.blogspot.in/2014/06/dynamic-partitioning-with-hive.html http://exploredatascience.blogspot.in/2014/06/dynamic-partitioning-with-hive.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM