简体   繁体   English

在Hive分区中编写子目录

[英]Writing sub directories in Hive partitions

Problem statement 问题陈述

I have files as follows with the schema Event_Time, AD_id 我有如下文件,模式为Event_Time,AD_id

file_20170102-May have records with event_time for 20170101,20170102,20170103
file_20170103-May have records with event_time for 20170102,20170103,20170104

Here event time is the time when the event occurred and the timestamp on filename is when the events were collected.So the timestamp on the filename and the event_time inside the file are out of sync. 这里事件时间是事件发生的时间,文件名的时间戳是收集事件的时间。因此文件名的时间戳和文件中的event_time不同步。

When i write this data to hive i definitely need to write data based on event_time partition because the users are interested in queries based on event_time. 当我将这些数据写入hive时,我肯定需要根据event_time分区写入数据,因为用户对基于event_time的查询感兴趣。

so my out put looks as follow 所以我的出局看起来如下

/path/to/output/event_time=20170102/....parquet
/path/to/output/event_time=20170103/....parquet

However i need to be able to keep track of file timestamp because some times a file gets reposted and we want to go delete already processed files based on file timestamp. 但是,我需要能够跟踪文件时间戳,因为有些时候文件被重新发布,我们想要根据文件时间戳删除已处理的文件。

is there a way i could write this /path/to/output/event_time=20170101/20170202(file_tiemstamp) 有没有办法我可以写这个/ path /到/ output / event_time = 20170101/20170202(file_tiemstamp)

Please note that in the above 20170102(file_timestamp) is a directory and not hive partition. 请注意,在上面的20170102(file_timestamp)中是一个目录而不是hive分区。

ALternately can i control the name of the parquet file so when i want to delete a file name its easy to figure out which files to delete 或者我可以控制镶木地板文件的名称,所以当我想删除文件名时,很容易找出要删除的文件

Demo 演示

Files under /home/dmarkovitz/myfiles /home/dmarkovitz/myfiles下的文件

myfile_1_20161204.csv myfile_1_20161204.csv

20161204,1
20161203,2

myfile_2_20161205.csv myfile_2_20161205.csv

20161203,3
20161204,4
20161205,5
20161203,6

myfile_3_20161205.csv myfile_3_20161205.csv

20161205,7
20161205,8
20161203,9

hive 蜂巢

create external table myfiles
(
    Event_Time  string
   ,AD_id       int
)
row format delimited
fields terminated by ','
stored as textfile
location 'file:///home/dmarkovitz/myfiles'
;

select  * 
       ,input__file__name

from    myfiles 
;

+------------+-------+-----------------------------------------------------+
| event_time | ad_id |                  input__file__name                  |
+------------+-------+-----------------------------------------------------+
|   20161204 |     1 | file:/home/dmarkovitz/myfiles/myfile_1_20161204.csv |
|   20161203 |     2 | file:/home/dmarkovitz/myfiles/myfile_1_20161204.csv |
|   20161205 |     7 | file:/home/dmarkovitz/myfiles/myfile_3_20161205.csv |
|   20161205 |     8 | file:/home/dmarkovitz/myfiles/myfile_3_20161205.csv |
|   20161203 |     9 | file:/home/dmarkovitz/myfiles/myfile_3_20161205.csv |
|   20161203 |     3 | file:/home/dmarkovitz/myfiles/myfile_2_20161205.csv |
|   20161204 |     4 | file:/home/dmarkovitz/myfiles/myfile_2_20161205.csv |
|   20161205 |     5 | file:/home/dmarkovitz/myfiles/myfile_2_20161205.csv |
|   20161203 |     6 | file:/home/dmarkovitz/myfiles/myfile_2_20161205.csv |
+------------+-------+-----------------------------------------------------+

create table mytable
(
    AD_id   int
)
partitioned by (file_dt date,Event_Time date)
stored as parquet
;

set hive.exec.dynamic.partition.mode=nonstrict;

insert into mytable partition (file_dt,Event_Time)

select  ad_id
       ,from_unixtime(unix_timestamp(split(input__file__name,'[_.]')[2],'yyyyMMdd'),'yyyy-MM-dd')
       ,from_unixtime(unix_timestamp(Event_Time,'yyyyMMdd'),'yyyy-MM-dd')

from    myfiles
;

show partitions mytable
;

+------------------------------------------+
|                partition                 |
+------------------------------------------+
| file_dt=2016-12-04/event_time=2016-12-03 |
| file_dt=2016-12-04/event_time=2016-12-04 |
| file_dt=2016-12-05/event_time=2016-12-03 |
| file_dt=2016-12-05/event_time=2016-12-04 |
| file_dt=2016-12-05/event_time=2016-12-05 |
+------------------------------------------+

select  *
       ,input__file__name 

from    mytable
;

+-------+------------+------------+----------------------------------------------------------------------+
| ad_id |  file_dt   | event_time |                          input__file__name                           |
+-------+------------+------------+----------------------------------------------------------------------+
|     2 | 2016-12-04 | 2016-12-03 | file:/mydb/mytable/file_dt=2016-12-04/event_time=2016-12-03/000000_0 |
|     1 | 2016-12-04 | 2016-12-04 | file:/mydb/mytable/file_dt=2016-12-04/event_time=2016-12-04/000000_0 |
|     9 | 2016-12-05 | 2016-12-03 | file:/mydb/mytable/file_dt=2016-12-05/event_time=2016-12-03/000000_0 |
|     3 | 2016-12-05 | 2016-12-03 | file:/mydb/mytable/file_dt=2016-12-05/event_time=2016-12-03/000000_0 |
|     6 | 2016-12-05 | 2016-12-03 | file:/mydb/mytable/file_dt=2016-12-05/event_time=2016-12-03/000000_0 |
|     4 | 2016-12-05 | 2016-12-04 | file:/mydb/mytable/file_dt=2016-12-05/event_time=2016-12-04/000000_0 |
|     7 | 2016-12-05 | 2016-12-05 | file:/mydb/mytable/file_dt=2016-12-05/event_time=2016-12-05/000000_0 |
|     8 | 2016-12-05 | 2016-12-05 | file:/mydb/mytable/file_dt=2016-12-05/event_time=2016-12-05/000000_0 |
|     5 | 2016-12-05 | 2016-12-05 | file:/mydb/mytable/file_dt=2016-12-05/event_time=2016-12-05/000000_0 |
+-------+------------+------------+----------------------------------------------------------------------+

explain dependency
select  *       
from    mytable
where   Event_Time = date '2016-12-04'        
;

{"input_tables":[{"tablename":"local_db@mytable","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"local_db@mytable@file_dt=2016-12-04/event_time=2016-12-04"},{"partitionName":"local_db@mytable@file_dt=2016-12-05/event_time=2016-12-04"}]} { “input_tables”:[{ “表名”: “local_db @ MYTABLE”, “TABLETYPE”: “MANAGED_TABLE”}], “input_partitions”:[{ “分区名”:“local_db @ MYTABLE @ file_dt = 2016年12月4日/ EVENT_TIME = 2016年12月4日 “},{” 分区名 “:” local_db @ MYTABLE @ file_dt = 2016年12月5日/ EVENT_TIME = 2016年12月4" 日}]}


bash 庆典

tree mytable

mytable
├── file_dt=2016-12-04
│   ├── event_time=2016-12-03
│   │   └── 000000_0
│   └── event_time=2016-12-04
│       └── 000000_0
└── file_dt=2016-12-05
    ├── event_time=2016-12-03
    │   └── 000000_0
    ├── event_time=2016-12-04
    │   └── 000000_0
    └── event_time=2016-12-05
        └── 000000_0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM