[英]Add partition projection to AWS Athena table using Cloudformation
我有一个使用 cloudformation 中指定的模板定义的 Athena 表:
Cloudformation 创建
EventsTable:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref DatabaseName
TableInput:
Description: "My Table"
Name: !Ref TableName
TableType: EXTERNAL_TABLE
StorageDescriptor:
Compressed: True
Columns:
- Name: account_id
Type: string
Comment: "Account Id of the account making the request"
...
InputFormat: org.apache.hadoop.mapred.TextInputFormat
SerdeInfo:
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
Location: !Sub "s3://${EventsBucketName}/events/"
这很好用并且可以部署。 我还发现我可以根据这个文档和这个文档创建分区投影
并且可以通过直接创建表来完成这项工作,大致如下:
SQL 创建
CREATE EXTERNAL TABLE `performance_data.events`
(
`account_id` string,
...
)
PARTITIONED BY (
`day` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://my-bucket/events/'
TBLPROPERTIES (
'has_encrypted_data' = 'false',
'projection.enabled' = 'true',
'projection.day.type' = 'date',
'projection.day.format' = 'yyyy/MM/dd',
'projection.day.range' = '2020/01/01,NOW',
'projection.day.interval' = '1',
'projection.day.interval.unit' = 'DAYS',
'storage.location.template' = 's3://my-bucket/events/${day}/'
)
但我找不到要转换成云形成结构的文档。 所以我的问题是,如何在cloudformation中实现SQL代码中显示的分区投影?
我现在有一个可行的解决方案。 缺少的部分实际上是缺少的参数,这是解决方案:
MyTableResource:
Type: AWS::Glue::Table
Properties:
CatalogId: MyAccountId
DatabaseName: MyDatabase
TableInput:
Description: "My Table"
Name: mytable
TableType: EXTERNAL_TABLE
PartitionKeys:
- Name: day
Type: string
Comment: Day partition
Parameters:
"projection.enabled": "true"
"projection.day.type": "date"
"projection.day.format": "yyyy/MM/dd"
"projection.day.range": "2020/01/01,NOW"
"projection.day.interval": "1"
"projection.day.interval.unit": "DAYS"
"storage.location.template": "s3://my-bucket/events/${day}/"
StorageDescriptor:
Compressed: True
Columns:
...
InputFormat: org.apache.hadoop.mapred.TextInputFormat
SerdeInfo:
Parameters:
serialization.format: '1'
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
Location: "s3://my-bucket/events/"
关键的补充是:
serialization.format: '1'
这现在完全有效,可以使用分区进行查询:
select * from mytable where day > '2022/05/03'
参考 Glue Table TableInput
的 CloudFormation 参考,您可以指定PartitionKeys
和Parameters
。 这相当于查询中的PARTITIONED BY
和TBLPROPERTIES
。
例如,您可以参考这篇文章。 下面的示例显示了如何定义PartitionKeys
以及如何为Parameters
定义 JSON。 在您的情况下,您只需添加投影键(例如projection.enabled
)和值( true
)。
# Create an Amazon Glue table
CFNTableFlights:
# Creating the table waits for the database to be created
DependsOn: CFNDatabaseFlights
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref CFNDatabaseName
TableInput:
Name: !Ref CFNTableName1
Description: Define the first few columns of the flights table
TableType: EXTERNAL_TABLE
Parameters: {
"classification": "csv"
}
# ViewExpandedText: String
PartitionKeys:
# Data is partitioned by month
- Name: mon
Type: bigint
StorageDescriptor:
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Columns:
- Name: year
Type: bigint
- Name: quarter
Type: bigint
- Name: month
Type: bigint
- Name: day_of_month
Type: bigint
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: s3://crawler-public-us-east-1/flight/2016/csv/
SerdeInfo:
Parameters:
field.delim: ","
SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.