简体   繁体   English

AWS Athena:按同一路径中的多列分区

[英]AWS Athena: partition by multiple columns in the same path

I am trying to create a table in Athena based on a directory in S3 that looks something like this:我正在尝试根据 S3 中的目录在 Athena 中创建一个表,如下所示:

folders/
  id=1/
    folder1/
    folder2/
    folder3/
      dt=***/
      dt=***/
  id=2/
...

I want to partition by two columns.我想按两列划分。 One is the id , and on is the dt .一个是id , on 是dt

So eventually I want my table to have an id column, and for each id , all of the dt 's in its sub-folder folder3 .所以最终我希望我的表有一个id列,并且对于每个id ,所有dt都在其子文件夹folder3中。 Is there any solution for this that doesn't force me to have a path like this: ...\id=\dt= ?有没有任何解决方案不会强迫我有这样的路径: ...\id=\dt=

I tried to simply set these two columns in the "partition by" section where the location is the "folders" path, then the table has no data.我试图简单地在“分区依据”部分中设置这两列,其中位置是“文件夹”路径,然后该表没有数据。

I then tried using injection and setting a specific id in a where clause when querying the table, but then the table contains data I don't need, and seems the partition doesn't work as I expected.然后,我尝试在查询表时使用注入并在 where 子句中设置特定的 id,但是该表包含我不需要的数据,并且似乎分区无法按预期工作。

Table DDL:表 DDL:

CREATE EXTERNAL TABLE IF NOT EXISTS `database`.`test_table` (
  `col1` string,
  `col2` string,
) PARTITIONED BY (
  id string,
  dt string
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://folders/'

Appreciate any help!感谢任何帮助!

You can "manually" add the partitions using something like您可以使用类似“手动”添加分区

alter table your_table add if not exists
partition (id=1, dt=0)
location '/id=1/folder3/dt=0/'
partition (id=1, dt=1)
location 'id=1/folder3/dt=1'
...

you can programmatically add all your partitions on s3 this way using the aws cli to list all folders, loop over them and add them to the partition table using a query like the above (see the docs ).您可以使用 aws cli 以这种方式在 s3 上以编程方式添加所有分区,以列出所有文件夹,遍历它们并使用上述查询将它们添加到分区表中(请参阅文档)。

An alternative is to use partition projection with custom storage locations, which has the benefit of giving you faster queries and removes the need for manually adding new partitions when new data arrives to S3 (see the partition projection docs , specially the section on custom S3 locations).另一种方法是使用带有自定义存储位置的分区投影,它的好处是为您提供更快的查询,并且无需在新数据到达 S3 时手动添加新分区(请参阅分区投影文档,特别是关于自定义 S3 位置的部分)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM