简体   繁体   English

AWS Athena:使用“文件夹”名称作为分区

[英]AWS Athena: use “folder” name as partition

I have thousands of individual json files (corresponding to one Table row) stored in s3 with the following path: s3://my-bucket/<date>/dataXX.json我在 s3 中存储了数千个单独的 json 文件(对应于一个 Table 行),路径如下: s3://my-bucket/<date>/dataXX.json

When I create my table in DDL, is it possible to have the data partitioned by the present in the S3 path ?当我在 DDL 中创建我的表时,是否可以通过 S3 路径中的当前数据对数据进行分区? (or at least add the value in a new column) (或至少在新列中添加值)

Thanks谢谢

Sadly this is not supported in Athena.遗憾的是,这在 Athena 中不受支持。 For partitioning to work with folders, there are requirements on how the folder must be named.要使用文件夹进行分区,必须对文件夹的命名方式提出要求。

eg s3://my-bucket/{columnname}={columnvalue}/data.json例如 s3://my-bucket/{columnname}={columnvalue}/data.json

In your case, you can still use partitioning if you add those partitions manually to the table.在您的情况下,如果您手动将这些分区添加到表中,您仍然可以使用分区。

eg ALTER TABLE tablename ADD PARTITION (datecolumn='2017-01-01') location 's3://my-bucket/2017-01-01/例如 ALTER TABLE tablename ADD PARTITION (datecolumn='2017-01-01') location 's3://my-bucket/2017-01-01/

The AWS docs have some good examples on that topic. AWS 文档有一些关于该主题的很好的例子。

AWS Athena Partitioning AWS 雅典娜分区

It is possible to do this now using storage.location.template.现在可以使用 storage.location.template 来做到这一点。 This will partition by some part of your path.这将按路径的某些部分进行分区。 Be sure to NOT include the new column in the column list, as it will automatically be added.确保不要在列列表中包含新列,因为它将自动添加。 There are a lot of options you can search to tweak this for your date example.您可以搜索很多选项来为您的日期示例进行调整。 I used "id" to show the simplest version i could think of.我用“id”来显示我能想到的最简单的版本。

 CREATE EXTERNAL TABLE `some_table`( `col1` bigint, PARTITIONED BY ( `id` string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION 's3://path/bucket/' TBLPROPERTIES ( 'has_encrypted_data'='false', 'projection.enabled'='true', 'projection.id.type' = 'injected', 'storage.location.template'='s3://path/bucket/${id}/' )

official docs: https://docs.amazonaws.cn/en_us/athena/latest/ug/partition-projection-dynamic-id-partitioning.html官方文档: https : //docs.amazonaws.cn/en_us/athena/latest/ug/partition-projection-dynamic-id-partitioning.html

Its not necessary to do this manually.没有必要手动执行此操作。 Setup a glue crawler and it will pick-up the folder( in the prefix) as a partition, if all the folders in the path has the same structure and all the data has the same schema design.设置一个glue crawler,如果路径中的所有文件夹具有相同的结构并且所有数据具有相同的架构设计,它将拾取文件夹(在前缀中)作为分区。

Put it will name the partition as partition0. Put 它会将分区命名为 partition0。 You can go into edit-schema and change the name of this partition to date or whatever you like.您可以进入编辑模式并将此分区的名称更改为日期或任何您喜欢的名称。

But make sure you go into your glue crawler and under "configuration options" select the option - "Add new columns only".但请确保您进入胶水爬虫并在“配置选项”下选择选项 - “仅添加新列”。 Otherwise on the next glue-crawler run it will reset the partition name back to partition0.否则在下一次胶水爬虫运行时,它会将分区名称重置回 partition0。

You need to name each S3 folder like this picture:您需要像这张图片一样命名每个 S3 文件夹:

图片

With Athena set up, specify dt for the partition:设置 Athena 后,为分区指定 dt:

图片

After that, run MSCK REPAIR TABLE <your table name>;之后,运行MSCK REPAIR TABLE <your table name>; on Athena在雅典娜

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM