简体   繁体   English

如何合并配置单元中现有分区的小文件?

[英]How to merge small files from existing partitions in hive?

How to merge existing Partition small files into one large file in one of the Partition . 如何将现有的分区小文件合并到一个分区中的一个大文件中。

For example I have a table user1, it contain columns fname,lname and partition column is day. 例如,我有一个表user1,它包含列fname,lname和分区列是天。

I have created table by using below script 我使用下面的脚本创建了表

CREATE TABLE user1(fname string,lname string) parittioned By (day int); CREATE TABLE user1(fname string,lname string)parittioned By(day int);

After inserting data into partion table it will look like below. 将数据插入分区表后,它将如下所示。

 fname  lname  day
.....................
AA      AAA   20170201     ....>partition 20170201
BB      BBB   20170201
...................
CC      CCC   20170202    ......>partition 20170202
DD      DDD   20170202
....................
EE      EEE   20170203    .......>partition 20170203
FF      FFF   20170203
.......................
GG      GGG   20170204    ........>partition 20170204         
HH      HHH   20170204
.......................

When I execute select query with the help of partition column ie day=20170201. 当我在分区列的帮助下执行选择查询,即day = 20170201。

select * from user1 where day=20170201;

It will give result like below 它将给出如下结果

AA      AAA   20170201
BB      BBB   20170201

based on above table i want to merge the all small files ie day =20170201 and day =20170202 and day=20170203 into partition day=20170203 in my partition table (ie USer1).ie It should look like below. 基于上面的表我想将所有小文件,即day = 20170201和day = 20170202以及day = 20170203合并到我的分区表(即USer1)中的分区日= 20170203 .ie它应该如下所示。

fname  lname  day
.....................
AA      AAA   20170201
BB      BBB   20170201
CC      CCC   20170202    
DD      DDD   20170202
E       EEE   20170203    .......>partition 20170203
FF      FFF   20170203
.......................
GG      GGG   20170204    ........>partition 20170204         
HH      HHH   20170204
.......................

can you please suggest on this,How can I achieve this? 你能就此提出建议吗?我怎样才能做到这一点?

Thanks in Advance. 提前致谢。

  1. Create new table partitioned by new field partition_day : 创建由新字段partition_day新表:
 CREATE TABLE user_new(fname string,lname string, day int) parittioned By (partition_day int); 
  1. Load data into new table (define your conditions for new partitionsin the case ) 将数据加载到新表中(在case定义新分区的case
  insert overwrite table user_new partition (partition_day) select fname,lname, day, case when day <= 20170203 then 20170203 when day > 20170203 then 20170204 end as partition_day from user1 ; 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM