简体   繁体   English

亚马逊雅典娜的分区表

[英]Partitioning table for amazon athena

I'm trying to partition data queried by amazon athena by year, month and day.我正在尝试按年、月和日对 amazon athena 查询的数据进行分区。 However, when I try to query from the partitioned data, I cannot get any records.但是,当我尝试从分区数据进行查询时,我无法获取任何记录。 I followed the instructions found in this blog post.我遵循了这篇博文中的说明。

Create table query:创建表查询:

CREATE external TABLE mvc_test2 (
ROLE struct<Scope: string, Id: string>,
ACCOUNT struct<ClientId: string, Id: string, Name: string>,
USER struct<Id: string, Name: string>,
IsAuthenticated INT,
Device struct<IpAddress: string>,
Duration double,
Id string,
ResultMessage string,
Application struct<Version: string, Build: string, Name: string>,
Timestamp string,
ResultCode INT
)
Partitioned by(year string, month string, day string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://firehose-titlesdesk-logs/Mvc/'

The table is created successfully, and the result message says:表创建成功,结果提示:

"Query successful. If your table has partitions, you need to load these partitions to be able to query data. You can either load all partitions or load them individually. If you use the load all partitions (MSCK REPAIR TABLE) command, partitions must be in a format understood by Hive. Learn more." 》查询成功。如果你的表有分区,需要加载这些分区才能查询数据。可以加载所有分区,也可以单独加载。如果使用加载所有分区(MSCK REPAIR TABLE)命令,分区必须以 Hive 理解的格式。了解更多信息。”

Running跑步

msck repair table mvc_test2;

I get the result:我得到结果:

"Partitions not in metastore: mvc_test2:2017/06/06/21 mvc_test2:2017/06/06/22" “不在 Metastore 中的分区:mvc_test2:2017/06/06/21 mvc_test2:2017/06/06/22”

At this point, I get no results when I try to query the table.此时,当我尝试查询表时没有得到任何结果。

The logs are stored in a subfolder format by year/month/day/hour.日志按年/月/日/小时以子文件夹格式存储。 eg: 's3://firehose-application-logs/process/year/month/day/hour'例如:'s3://firehose-application-logs/process/year/month/day/hour'

How do I correctly partition the data?如何正确分区数据?

It appears that your directory format is 2017/06/06/22 .您的目录格式似乎是2017/06/06/22 This is not compatible with HIVE partitions, that have naming conversions of year=2017/month=06/day=06/hour=22 .这与 HIVE 分区不兼容,它们的命名转换为year=2017/month=06/day=06/hour=22

Therefore, the current format of your data precludes your ability to use partitions.因此,您的数据的当前格式使您无法使用分区。 You would need to rename directories or (preferably) process your data through HIVE to store it in the correct format.您需要重命名目录或(最好)通过 HIVE 处理您的数据以将其存储为正确的格式。

See also: Analyzing Data in S3 using Amazon Athena另请参阅: 使用 Amazon Athena 分析 S3 中的数据

add each partition by date.按日期添加每个分区。 It's faster this way and saves you more money.这种方式速度更快,为您节省更多的钱。 Load only the partition you need, and not all partitions.仅加载您需要的分区,而不是所有分区。

ALTER TABLE mvc_test2 
ADD PARTITION (year='2017',month='06',day='06')
location 's3://firehose-titlesdesk-logs/Mvc/'

You can load more partitions by changing the year, month and/or day as needed, just make sure they are valid.您可以通过根据需要更改年、月和/或日来加载更多分区,只需确保它们有效即可。 Then you can check to make sure your partitions are loaded by running this query:然后,您可以通过运行以下查询来检查以确保您的分区已加载:

show partitions mvc_test2

AWS now support Athena Partition Projections , which would automate the partitioning management and automatically adds new partitions as new data is added AWS 现在支持Athena Partition Projections ,这将自动进行分区管理并在添加新数据时自动添加新分区

https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html#create-cloudtrail-table-partition-projection https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html#create-cloudtrail-table-partition-projection

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM