简体   繁体   English

按现有字段分区 Hive 表?

[英]Partition Hive table by existing field?

Can I partition a Hive table upon insert by an existing field?我可以在通过现有字段插入时对 Hive 表进行分区吗?

I have a 10 GB file with a date field and an hour of day field.我有一个 10 GB 的文件,其中包含一个日期字段和一个小时字段。 Can I load this file into a table, then insert-overwrite into another partitioned table that uses those fields as a partition?我可以将此文件加载到表中,然后插入覆盖到另一个使用这些字段作为分区的分区表中吗? Would something like the following work?会像以下工作吗?

INSERT OVERWRITE TABLE tealeaf_event  PARTITION(dt=evt.datestring,hour=evt.hour) 
SELECT * FROM staging_event evt;

Thanks!谢谢!

Travis特拉维斯

I just ran across this trying to answer the same question and it was helpful but not quite complete.我刚刚遇到这个试图回答同样的问题,它很有帮助,但并不完整。 The short answer is yes, something like the query in the question will work but the syntax is not quite right.简短的回答是肯定的,类似问题中的查询会起作用,但语法不太正确。

Say you have three tables which were created using the following statements:假设您有三个使用以下语句创建的表:

CREATE TABLE staging_unpartitioned (datestring string, hour int, a int, b int);

CREATE TABLE staging_partitioned (a int, b int) 
    PARTITIONED BY (datestring string, hour int);

CREATE TABLE production_partitioned (a int, b int) 
    PARTITIONED BY (dt string, hour int);

Columns a and b are just some example columns. a列和b列只是一些示例列。 dt and hour are the values we want to partition on once it gets to the production table. dthour是我们想要在它到达生产表后对其进行分区的值。 Moving the staging data to production from staging_unpartitioned and staging_partitioned looks exactly the same.将 staging 数据从staging_unpartitionedstaging_partitioned移至生产环境看起来完全一样。

INSERT OVERWRITE TABLE production_partitioned PARTITION (dt, hour)
    SELECT a, b, datestring, hour FROM staging_unpartitioned;

INSERT OVERWRITE TABLE production_partitioned PARTITION (dt, hour)
    SELECT a, b, datestring, hour FROM staging_partitioned;

This uses a process called Dynamic Partitioning which you can read about here .这使用了一个称为动态分区的过程,您可以在此处阅读。 The important thing to note is that which columns are associated with which partitions is determined by the SELECT order.需要注意的重要一点是,哪些列与哪些分区相关联是由 SELECT 顺序决定的。 All dynamic partitions must be selected last and in order.所有动态分区必须按顺序最后选择。

There's a good chance when you try to run the code above you will hit an error due to the properties you have set.当您尝试运行上面的代码时,很有可能会因为您设置的属性而遇到错误。 First, it will not work if you have dynamic partitioning disabled so make sure to:首先,如果您禁用了动态分区,它将无法工作,因此请确保:

set hive.exec.dynamic.partition=true;

Then you might hit an error if you aren't partitioning on at least one static partition before the dynamic partitions.如果您在动态分区之前没有在至少一个 static 分区上进行分区,那么您可能会遇到错误。 This restriction would save you accidentally removing a root partition when you meant to overwrite its sub-partitions with dynamic partitions.当您打算用动态分区覆盖其子分区时,此限制将避免您意外删除根分区。 In my experience this behavior has never been helpful and has often been annoying, but your mileage may vary.以我的经验,这种行为从来没有帮助过,而且经常很烦人,但你的里程可能会有所不同。 At any rate, it is easy to change:无论如何,很容易改变:

set hive.exec.dynamic.partition.mode=nonstrict;

And that should do it.那应该这样做。

Maybe this is already answered... but yes, you can do exactly as you have stated.也许这已经被回答了......但是是的,你可以完全按照你所说的去做。 I have done it many times.我已经做过很多次了。 Obviously your new table would need to be defined similar to the original one, but without the partition column, and with the partition specification.显然,您的新表需要与原始表类似地定义,但没有分区列,并且具有分区规范。 Also, I cannot remember if I had to explicitly list out the columns in the original table, or if the asterik was sufficient.另外,我不记得是否必须明确列出原始表中的列,或者星号是否足够。

I'm not super sure about this, but something like this might work我对此不太确定,但这样的事情可能会奏效

INSERT OVERWRITE TABLE tealeaf_event
SELECT col1 as tealeaf_col1, ..., datestring as ds;

No. You will have to drop that field or, at least, rename it.不可以。您必须删除该字段,或者至少重命名它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM