简体   繁体   English

从Pig中的配置单元表加载最新分区

[英]Load the latest partition from a hive table in pig

I want to know how I can load the latest partition from a hive table in a pig script. 我想知道如何从Pig脚本中的配置单元表中加载最新分区。 Obviously, I can load the whole data and then use the FILTER command to filter the corresponding partition. 显然,我可以加载整个数据,然后使用FILTER命令过滤相应的分区。

However, if we don't know what is the latest date partition for the hive table, how can we get the latest date itself and load the partition for that corresponding date? 但是,如果我们不知道hive表的最新日期分区是什么,我们如何获取最新日期本身并加载对应日期的分区?

as of my knowledge we cant do it directly.i am pointing some way with shell scripting. 据我所知,我们不能直接做到这一点。我用Shell脚本指出了某种方式。 hope your partioned columns is in datehour format or numarical incremental order. 希望您分配的列采用datehour格式或数字递增顺序。

hive -e 'select max(datehour) from tweets1' > datehour.txt;

   # i am storing of above query output to one temp file datehour.txt

datehour=$(awk '{print $0}' /home/winit/Desktop/needtocopy1/hivequeries/datehour.txt)

   # reading that file with above command.

 hive -e 'describe formatted tweets1  partition (datehour='$datehour')' > partitionloc.txt;

   # with describe command i am storing output to onemore temp file.

 partionLocation=$(awk '/Location:/ { print $2 }' partitionloc.txt)

  # i am reading the temp file with pattern 'Location',its partition location

  # pass the location to pig script as parameter to load data from..

 pig  -f  pigfile.pig --param location=$partionLocation

let me know if not works 让我知道是否有效

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM