简体   繁体   English

在 Athena CTAS 上创建 100 多个分区的替代方法

[英]Alternative to create more than 100 partitions on Athena CTAS

I'm currently creating some new tables from information stored in Amazon S3.我目前正在根据存储在 Amazon S3 中的信息创建一些新表。 First time using AWS, today I learn that Amazon Athena can't create more than 100 partitions from a CTAS query.第一次使用 AWS,今天我了解到 Amazon Athena 无法通过 CTAS 查询创建超过 100 个分区。

I'm doing the transformations using sql, it works perfectly, but need a way to store more than 100 partitions at once to make the process more reliable.我正在使用 sql 进行转换,它运行良好,但需要一种方法来一次存储 100 多个分区以使过程更可靠。

I'm setting the partition on date, so in 4 months my process is going to fail if I need to recreate the table to load a large amount of data via sql (where I have the transformations).我将分区设置为日期,因此如果我需要重新创建表以通过 sql (我有转换)加载大量数据,我的过程将在 4 个月内失败。

Any idea of how can I achieve this?知道如何实现这一目标吗?

The best option would be to write a Glue ETL (spark) job for this task and use spark sql to perform the required transformations.最好的选择是为此任务编写 Glue ETL (spark) 作业并使用 spark sql 执行所需的转换。 That way you still get to use your existing sql queries.这样您仍然可以使用现有的 sql 查询。

Then you can write the processed output back to some S3 path.然后您可以将处理后的 output 写回某个 S3 路径。 Spark allows you create as many partitions as you want. Spark 允许您创建任意数量的分区。 Also it allows to append the newly processed data to already processed data, there by allowing you to load and transform only the new data.它还允许将新处理的数据转换为已处理的数据,从而允许您仅加载和转换新数据。

Once the ETL is done, create an external table pointing to the above used S3 path and required partitions. ETL 完成后,创建一个指向上述 S3 路径和所需分区的外部表。 This will be one time step (creating external table).这将是一个时间步骤(创建外部表)。 You will only need to update the partition information in this external table after every glue job.您只需在每次粘合作业后更新此外部表中的分区信息。

In summary, you need to do the following:总之,您需要执行以下操作:

  • Create a spark script to be executed on Glue ETL which will read daily source data, apply required transformations and write the processed data on S3 in a new partition.创建要在 Glue ETL 上执行的 spark 脚本,该脚本将读取每日源数据,应用所需的转换并将处理后的数据写入 S3 上的新分区中。 This script can be easily tampletized for accepting date as input and will be one time activity.这个脚本可以很容易地被压印以接受日期作为输入,并且将是一次性的活动。

  • Create an external table pointing to the processed data on S3.创建一个指向 S3 上已处理数据的外部表。 This will also be one time activity.这也将是一次活动。

  • Execute MSCK Repair command on above external table after every Glue ETL job to update the new partition.在每次 Glue ETL 作业后对上述外部表执行 MSCK 修复命令以更新新分区。

References:参考:

AWS Glue ETL documentationAWS Glue ETL 文档

AWS Athena - Create external table AWS Athena - 创建外部表

AWS Athena - Update partiotion AWS Athena - 更新分区

Let's say you want to processes 4 month of data with CTAS queries, but you need to partition it by day.假设您要使用 CTAS 查询处理 4 个月的数据,但您需要按天对其进行分区。 If you do it in a single CTAS query you will end up with roughly 4 x 30 = 120 partitions, thus, query will fail as you mention due to AWS limitations .如果您在单个 CTAS 查询中执行此操作,您最终将得到大约 4 x 30 = 120 个分区,因此,由于AWS 限制,查询将失败,正如您所提到的。

Instead, you can process you data for each month at a time so you will be guarantied to have less then 31 partition at a time.相反,您可以一次处理每个月的数据,这样您就可以保证一次拥有少于 31 个分区。 However, the result of each CTAS query should have a unique external location on S3, ie if you want to store results of multiple CTAS queries under s3://bukcet-name/data-root you would need to extend this path for each query in external_location under WITH clause.但是,每个 CTAS 查询的结果应该在 S3 上具有唯一的外部位置,即,如果您想在s3://bukcet-name/data-root下存储多个 CTAS 查询的结果,则需要为每个查询扩展此路径在WITH子句下的external_location中。 The obvious choice for your case would be full date, for example:对于您的案例,显而易见的选择是完整日期,例如:

s3://bukcet-name/data-root
├──2019-01            <-- external_location='s3://bukcet-name/data-root/2019-01'
|   └── month=01
|       ├── day=01
|       |   ...
|       └── day=31
├──2019-02            <-- external_location='s3://bukcet-name/data-root/2019-02'
|   └── month=02
|       ├── day=01
|       |   ...
|       └── day=28
...

However, now you ended up with 4 different tables.但是,现在您最终得到了 4 个不同的表。 So you either need to query different tables, or you have to do some postprocessing.因此,您要么需要查询不同的表,要么必须进行一些后处理。 Essentially, you would have two options本质上,您将有两个选择

  1. Move all new files into a common place with AWS CLI high-level commands which should be followed by MSCK REPAIR TABLE since output "directory" structure adheres HIVE partitioning naming convention.使用AWS CLI 高级命令将所有新文件移动到一个公共位置,这些命令后面应该是MSCK REPAIR TABLE ,因为 output“目录”结构遵循 HIVE 分区命名约定。 For example from例如从

    s3://bukcet-name/data-staging-area ├──2019-01 <-- external_location='s3://bukcet-name/data-staging-area/2019-01' | └── month=01 | ├── day=01 | |...

    your would copy into你会复制到

    s3://bukcet-name/data-root ├── month=01 | ├── day=01 | |... | └── day=31 ├── month=02 | ├── day=01 | |... | └── day=28
  2. Manipulate with AWS Glue Data Catalog.使用 AWS Glue 数据目录进行操作。 This is a little bit more trickier, but the main idea is that you define a root table with location pointing to s3://bukcet-name/data-root .这有点棘手,但主要思想是您定义一个根表,其位置指向s3://bukcet-name/data-root Then after executing CTAS query, you would need to copy meta-information about partitions from created "staging" table into the root table.然后在执行 CTAS 查询后,您需要将有关分区的元信息从创建的“暂存”表复制到根表中。 This step would be based on AWS Glue API via for example boto3 library for Python.此步骤将基于AWS Glue API ,例如用于 Python 的boto3库。 In particular, you would use get_partitions() and batch_create_partition() methods.特别是,您将使用get_partitions()batch_create_partition()方法。

Regardless which approach you choose, you would need use some sort of job scheduling software, especially since your data is not just historical.无论您选择哪种方法,您都需要使用某种作业调度软件,尤其是因为您的数据不仅仅是历史数据。 I would suggest to use Apache Airflow for that.我建议为此使用Apache Airflow It can be seen as an alternative to a combination of Lambda and Step Functions, it is totally free.它可以看作是 Lambda 和 Step Functions 组合的替代方案,它是完全免费的。 There are plenty blog posts and documentation that can help you get started.有很多博客文章和文档可以帮助您入门。 For example:例如:

  • Medium post : Automate executing AWS Athena queries and moving the results around S3 with Airflow. 中篇文章:使用 Airflow 自动执行 AWS Athena 查询并在 S3 周围移动结果。
  • Complete guide to installation of Airflow, link 1 and link 2 Airflow 安装完整指南, 链接 1链接 2

You can even setup integration with Slack for sending notification when you queries terminate either in success or fail state.您甚至可以设置与 Slack 的集成,以便在查询成功或失败时发送通知 state。

Things to keep in mind:要记住的事情:

In general, you don't have explicit control of how many files will be created as a result of CTAS query, since Athena is a distributed system.通常,您无法明确控制 CTAS 查询将创建多少文件,因为 Athena 是一个分布式系统。 On the other hand, you don't want to have a lot of small files.另一方面,您不想拥有很多小文件。 So can try this to use "this workaround" which uses bucketed_by and bucket_count fields within WITH clause因此可以尝试使用“此解决方法” ,它在WITH子句中使用bucketed_bybucket_count字段

CREATE TABLE new_table
WITH (
    ...
    bucketed_by=ARRAY['some_column_from_select'],
    bucket_count=1
) AS (
    -- Here goes your normal query 
    SELECT 
        *
    FROM 
        old_table;
)

Alternatively, reduce number of partitions, ie stop at month level.或者,减少分区数量,即在月级别停止。

Amazon Athena has a separate guide dedicated to this topic. Amazon Athena 有专门针对此主题的单独指南

The main steps:主要步骤:

  1. Use CREATE EXTERNAL TABLE to prepare a table partitioned as expected使用CREATE EXTERNAL TABLE准备一个按预期分区的表
  2. Use CTAS with a low enough number of partitions使用分区数足够少的 CTAS
  3. Iteratively use INSERT INTO to fill up missing partitions.迭代地使用 INSERT INTO 来填充缺失的分区。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM