简体   繁体   English

H3在S3上进行数据分区

[英]Hive partitioning for data on s3

Our data is stored using s3://bucket/YYYY/MM/DD/HH and we are using aws firehouse to land parquet data in there locations in near real time . 我们的数据是使用s3://bucket/YYYY/MM/DD/HH并且我们正在使用aws firehouse实时将木地板数据放置在该位置。 I can query data using AWS athena just fine however we have a hive query cluster which is giving troubles querying data when partitioning is enabled . 我可以很好地使用AWS athena查询数据,但是我们有一个配置单元查询集群,当启用分区时,查询数据会遇到麻烦。

This is what I am doing : PARTITIONED BY ( `year` string, `month` string, `day` string, `hour` string) 这就是我正在做的: PARTITIONED BY ( `year` string, `month` string, `day` string, `hour` string)

This doesn't seem to work when data on s3 is stored as s3:bucket/YYYY/MM/DD/HH 当s3上的数据存储为s3:bucket/YYYY/MM/DD/HH时,这似乎不起作用s3:bucket/YYYY/MM/DD/HH

however this does work for s3:bucket/year=YYYY/month=MM/day=DD/hour=HH 但这适用于s3:bucket/year=YYYY/month=MM/day=DD/hour=HH

Given the stringent bucket paths of firehose i cannot modify the s3 paths. 给定firehose的严格存储桶路径,我无法修改s3路径。 So my question is what's the right partitioning scheme in hive ddl when you don't have an explicitly defined column name on your data path like year = or month = ? 所以我的问题是,当您的数据路径上没有像year =或month =这样的明确定义的列名时,hive ddl中正确的分区方案是什么?

如果您无法按照配置单元命名约定获取文件夹名称,则需要手动映射所有分区

ALTER TABLE tableName ADD PARTITION (year='YYYY') LOCATION 's3:bucket/YYYY'

Now you can specify S3 prefix in firehose. 现在,您可以在firehose中指定S3前缀。 https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html

myPrefix/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM