简体   繁体   中英

Partitioning table for amazon athena

I'm trying to partition data queried by amazon athena by year, month and day. However, when I try to query from the partitioned data, I cannot get any records. I followed the instructions found in this blog post.

Create table query:

CREATE external TABLE mvc_test2 (
ROLE struct<Scope: string, Id: string>,
ACCOUNT struct<ClientId: string, Id: string, Name: string>,
USER struct<Id: string, Name: string>,
IsAuthenticated INT,
Device struct<IpAddress: string>,
Duration double,
Id string,
ResultMessage string,
Application struct<Version: string, Build: string, Name: string>,
Timestamp string,
ResultCode INT
)
Partitioned by(year string, month string, day string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://firehose-titlesdesk-logs/Mvc/'

The table is created successfully, and the result message says:

"Query successful. If your table has partitions, you need to load these partitions to be able to query data. You can either load all partitions or load them individually. If you use the load all partitions (MSCK REPAIR TABLE) command, partitions must be in a format understood by Hive. Learn more."

Running

msck repair table mvc_test2;

I get the result:

"Partitions not in metastore: mvc_test2:2017/06/06/21 mvc_test2:2017/06/06/22"

At this point, I get no results when I try to query the table.

The logs are stored in a subfolder format by year/month/day/hour. eg: 's3://firehose-application-logs/process/year/month/day/hour'

How do I correctly partition the data?

It appears that your directory format is 2017/06/06/22 . This is not compatible with HIVE partitions, that have naming conversions of year=2017/month=06/day=06/hour=22 .

Therefore, the current format of your data precludes your ability to use partitions. You would need to rename directories or (preferably) process your data through HIVE to store it in the correct format.

See also: Analyzing Data in S3 using Amazon Athena

add each partition by date. It's faster this way and saves you more money. Load only the partition you need, and not all partitions.

ALTER TABLE mvc_test2 
ADD PARTITION (year='2017',month='06',day='06')
location 's3://firehose-titlesdesk-logs/Mvc/'

You can load more partitions by changing the year, month and/or day as needed, just make sure they are valid. Then you can check to make sure your partitions are loaded by running this query:

show partitions mvc_test2

AWS now support Athena Partition Projections , which would automate the partitioning management and automatically adds new partitions as new data is added

https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html#create-cloudtrail-table-partition-projection

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM