简体繁体 English

AWS Athena查询分区

[英]AWS Athena Query Partitioning

原文 2019-04-26 14:42:06 4 1 amazon-web-services/ amazon-s3/ amazon-athena/ amazon-kinesis-firehose

I am trying to use AWS Athena to provide analytics for an existing platform. 我正在尝试使用AWS Athena为现有平台提供分析。 Currently the flow looks like this: 当前流程如下：

Data is pumped into a Kinesis Firehose as JSON events. 数据作为JSON事件被抽入Kinesis Firehose。
The Firehose converts the data to parquet using a table in AWS Glue and writes to S3 either every 15 mins or when the stream reaches 128 MB (max supported values). Firehose使用AWS Glue中的表格将数据转换为实木复合地板，并每15分钟或当流达到128 MB（最大支持值）时写入S3。
When the data is written to S3 it is partitioned with a path /year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/... 将数据写入S3时，将使用路径/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/...分区/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/...
An AWS Glue crawler update a table with the latest partition data every 24 hours and makes it available for queries. AWS Glue搜寻器每24小时使用最新的分区数据更新一个表，并使该表可用于查询。

The basic flow works. 基本流程起作用。 However, there are a couple of problems with this... 但是，这有两个问题……

The first (and most important) is that this data is part of a multi-tenancy application. 第一个（也是最重要的）是此数据是多租户应用程序的一部分。 There is a property inside each event called account_id . 每个事件中都有一个名为account_id的属性。 Every query that will ever be issued will be issued by a specific account and I don't want to be scanning all account data for every query. 将会发出的每个查询都将由一个特定的帐户发出，我不想为每个查询扫描所有帐户数据。 I need to find a scalable way query only the relevant data. 我需要找到一种可伸缩的方式，仅查询相关数据。 I did look into trying to us Kinesis to extract the account_id and use it as a partition. 我确实尝试过向我们Kinesis提取account_id并将其用作分区。 However, this currently isn't supported and with > 10,000 accounts the AWS 20k partition limit quickly becomes a problem. 但是，当前不支持此功能，并且拥有10,000多个帐户时，AWS 20k分区限制很快就成为问题。

The second problem is file size! 第二个问题是文件大小！ AWS recommend that files not be < 128 MB as this has a detrimental effect on query times as the execution engine might be spending additional time with the overhead of opening Amazon S3 files. AWS建议文件大小不要小于128 MB，因为这会对查询时间产生不利影响，因为执行引擎可能会花费更多时间来打开Amazon S3文件。 Given the nature of the Firehose I can only ever reach a maximum size of 128 MB per file. 考虑到Firehose的性质，每个文件最大只能达到128 MB。

1 个解决方案

With that many accounts you probably don't want to use account_id as partition key for many reasons. 拥有这么多帐户，出于多种原因，您可能不想使用account_id作为分区键。 I think you're fine limits-wise, the partition limit per table is 1M , but that doesn't mean it's a good idea. 我认为您是很好的限制，每个表的分区限制为1M ，但这并不意味着它是一个好主意。

You can decrease the amount of data scanned significantly by partitioning on parts of the account ID, though. 但是，通过对部分帐户ID进行分区，可以大大减少扫描的数据量。 If your account IDs are uniformly distributed (like AWS account IDs) you can partition on a prefix. 如果您的帐户ID是均匀分布的（例如AWS帐户ID），则可以在前缀上进行分区。 If your account IDs are numeric partitioning on the first digit would decrease the amount of data each query would scan by 90%, and with two digits 99% – while still keeping the number of partitions at very reasonable levels. 如果您的帐户ID是数字分区，则在第一个数字上进行分区将使每个查询将扫描的数据量减少90％，在两位数的基础上减少99％，同时仍将分区数保持在非常合理的水平。

Unfortunately I don't know either how to do that with Glue. 不幸的是，我也不知道如何用Glue做到这一点。 I've found Glue very unhelpful in general when it comes to doing ETL. 我发现在进行ETL时，Glue通常非常无用。 Even simple things are hard in my experience. 就我而言，即使简单的事情也很难。 I've had much more success using Athena's CTAS feature combined with some simple S3 operation for adding the data produced by a CTAS operation as a partition in an existing table. 使用Athena的CTAS功能与一些简单的S3操作相结合，将CTAS操作产生的数据添加为现有表中的分区，我获得了更大的成功。

If you figure out a way to extract the account ID you can also experiment with separate tables per account, you can have 100K tables in a database . 如果您想出一种提取帐户ID的方法，也可以对每个帐户使用单独的表进行试验，则数据库中可以有100K个表。 It wouldn't be very different from partitions in a table, but could be faster depending on how Athena determines which partitions to query. 它与表中的分区没有太大区别，但是可以更快，具体取决于Athena如何确定要查询的分区。

Don't worry too much about the 128 MB file size rule of thumb. 不必太担心128 MB文件大小的经验法则。 It's absolutely true that having lots of small files is worse than having few large files – but it's also true that scanning through a lot of data to filter out just a tiny portion is very bad for performance, and cost. 毫无疑问，拥有很多小文件比拥有几个大文件要糟-但是，也确实要扫描大量数据以过滤出一小部分，这对性能和成本都非常不利。 Athena can deliver results in a second even for queries over hundreds of files that are just a few KB in size. 即使对数百个文件大小只有几KB的文件进行查询，Athena都可以在一秒钟内提供结果。 I would worry about making sure Athena was reading the right data first, and about ideal file sizes later. 我会担心确保Athena首先读取正确的数据，然后再确保理想的文件大小。

If you tell me more about the amount of data per account and expected life time of accounts I can give more detailed suggestions on what to aim for. 如果您告诉我有关每个帐户的数据量和帐户的预期寿命的更多信息，我可以针对目标提供更详细的建议。

Update: Given that Firehose doesn't let you change the directory structure of the input data, and that Glue is generally pretty bad, and the additional context you provided in a comment, I would do something like this: 更新：鉴于Firehose不允许您更改输入数据的目录结构，并且Glue通常非常糟糕，并且您在注释中提供了其他上下文，因此我将执行以下操作：

Create an Athena table with columns for all properties in the data, and date as partition key. 创建一个Athena表，其中包含数据中所有属性的列，并将日期作为分区键。 This is your input table, only ETL queries will be run against this table. 这是您的输入表，将仅对该表运行ETL查询。 Don't worry that the input data has separate directories for year, month, and date, you only need one partition key. 不必担心输入数据具有用于年，月和日的单独目录，您只需要一个分区键。 It just complicates things to have these as separate partition keys, and having one means that it can be of type DATE , instead of three separate STRING columns that you have to assemble into a date every time you want to do a date calculation. 将它们作为单独的分区键只会使事情变得复杂，并且具有一种意味着它可以是DATE类型，而不是每次要进行日期计算时都必须将其组合成一个日期的三个单独的STRING列。
Create another Athena table with the same columns, but partitioned by account_id_prefix and either date or month. 创建具有相同列的另一个Athena表，但按account_id_prefix以及日期或月份进行分区。 This will be the table you run queries against. 这将是您对其运行查询的表。 account_id_prefix will be one or two characters from your account ID – you'll have to test what works best. account_id_prefix将是您帐户ID中的一两个字符–您必须测试最有效的方法。 You'll also have to decide whether to partition on date or a longer time span. 您还必须决定是按日期分区还是更长的时间跨度。 Dates will make ETL easier and cheaper, but longer time spans will produce fewer and larger files, which can make queries more efficient (but possibly more expensive). 日期将使ETL更容易和更便宜，但是更长的时间跨度将产生更少和更大的文件，这可以使查询效率更高（但可能更昂贵）。
Create a Step Functions state machine that does the following (in Lambda functions): 创建一个执行以下操作的“步函数”状态机（在Lambda函数中）：
- Add new partitions to the input table. 将新分区添加到输入表。 If you schedule your state machine to run once per day it can just add the partition that correspond to the current date. 如果您将状态机计划为每天运行一次，则只需添加与当前日期相对应的分区即可。 Use the Glue CreatePartition API call to create the partition (unfortunately this needs a lot of information to work, you can run a GetTable call to get it, though. Use for example ["2019-04-29"] as Values and "s3://some-bucket/firehose/year=2019/month=04/day=29" as StorageDescriptor.Location . This is the equivalent of running ALTER TABLE some_table ADD PARTITION (date = '2019-04-29) LOCATION 's3://some-bucket/firehose/year=2019/month=04/day=29' – but doing it through Glue is faster than running queries in Athena and more suitable for Lambda. 使用Glue CreatePartition API调用来创建分区（不幸的是，这需要大量信息才能工作，但是您可以运行GetTable调用来获取它。例如，将["2019-04-29"]用作Values和"s3://some-bucket/firehose/year=2019/month=04/day=29"作为StorageDescriptor.Location 。这等效于运行ALTER TABLE some_table ADD PARTITION (date = '2019-04-29) LOCATION 's3://some-bucket/firehose/year=2019/month=04/day=29'但是通过Glue进行操作比在Athena中运行查询更快，并且更适合Lambda。
- Start a CTAS query over the input table with a filter on the current date, partitioned by the first character(s) or the account ID and the current date. 使用当前日期的过滤器在输入表上启动CTAS查询，该过滤器按第一个字符或帐户ID和当前日期进行分区。 Use a location for the CTAS output that is below your query table's location. 将CTAS输出的位置使用在查询表的位置下方。 Generate a random name for the table created by the CTAS operation, this table will be dropped in a later step. 为CTAS操作创建的表生成一个随机名称，该表将在以后的步骤中删除。 Use Parquet as the format. 使用Parquet作为格式。
- Look at the Poll for Job Status example state machine for inspiration on how to wait for the CTAS operation to complete. 查看“ 作业状态轮询”示例状态机，以获取有关如何等待CTAS操作完成的灵感。
- When the CTAS operation has completed list the partitions created in the temporary table created with Glue GetPartitions and create the same partitions in the query table with BatchCreatePartitions . 当CTAS操作完成时，列出使用Glue GetPartitions创建的临时表中创建的分区，并使用BatchCreatePartitions在查询表中创建相同的分区。
- Finally delete all files that belong to the partitions of the query table you deleted and drop the temporary table created by the CTAS operation. 最后，删除属于已删除查询表分区的所有文件，并删除由CTAS操作创建的临时表。

If you decide on a partitioning on something longer than date you can still use the process above, but you also need to delete partitions in the query table and the corresponding data on S3, because each update will replace existing data (eg with partitioning by month, which I would recommend you try, every day you would create new files for the whole month, which means that the old files need to be removed). 如果您决定对某个日期进行更长的分区，您仍然可以使用上面的过程，但是您还需要删除查询表中的分区以及S3上的相应数据，因为每次更新都将替换现有数据（例如，按月进行分区），我建议您每天尝试尝试创建一个月的新文件，这意味着需要删除旧文件。 If you want to update your query table multiple times per day it would be the same. 如果您想每天多次更新查询表，那将是相同的。

This looks like a lot, and looks like what Glue Crawlers and Glue ETL does – but in my experience they don't make it this easy. 这看起来很像，并且看起来像Glue Crawlers和Glue ETL所做的一样-但以我的经验来看，他们做起来并不容易。

In your case the data is partitioned using Hive style partitioning, which Glue Crawlers understand, but in many cases you don't get Hive style partitions but just Y/M/D (and I didn't actually know that Firehose could deliver data this way, I thought it only did Y/M/D). 在您的情况下，数据使用Hive样式分区进行了分区，这是Glue Crawlers理解的，但是在许多情况下，您不会获得Hive样式分区，而仅是Y / M / D（而且我实际上并不知道Firehose可以提供此数据方式，我以为只做Y / M / D）。 A Glue Crawler will also do a lot of extra work every time it runs because it can't know where data has been added, but you know that the only partition that has been added since yesterday is the one for yesterday, so crawling is reduced to a one-step-deal. Glue Crawler每次运行时也会做很多额外的工作，因为它不知道在何处添加了数据，但是您知道自昨天以来添加的唯一分区是昨天的分区，因此减少了爬网一键交易。

Glue ETL is also makes things very hard, and it's an expensive service compared to Lambda and Step Functions. 胶水ETL也使事情变得非常困难，与Lambda和Step Functions相比，它是一项昂贵的服务。 All you want to do is to convert your raw data form JSON to Parquet and re-partition it. 您要做的就是将原始数据格式从JSON转换为Parquet并重新分区。 As far as I know it's not possible to do that with less code than an Athena CTAS query. 据我所知，用比雅典娜CTAS查询更少的代码来做到这一点是不可能的。 Even if you could make the conversion operation with Glue ETL in less code, you'd still have to write a lot of code to replace partitions in your destination table – because that's something that Glue ETL and Spark simply doesn't support. 即使您可以使用更少的代码使用Glue ETL进行转换操作，您仍然必须编写大量代码来替换目标表中的分区-因为那是Glue ETL和Spark根本不支持的东西。

Athena CTAS wasn't really made to do ETL, and I think the method I've outlined above is much more complex than it should be, but I'm confident that it's less complex than trying to do the same thing (ie continuously update and potentially replace partitions in a table based on the data in another table without rebuilding the whole table every time). Athena CTAS并不是真正要做ETL的工具，我认为我上面概述的方法比应该的复杂得多，但是我相信它比尝试做相同的事情要简单（即不断更新）并有可能根据另一个表中的数据替换表中的分区，而无需每次都重新构建整个表）。

What you get with this ETL process is that your ingestion doesn't have to worry about partitioning more than by time, but you still get tables that are optimised for querying. 通过此ETL过程，您获得的数据不必担心按时间划分分区，但是您仍然可以获得针对查询进行优化的表。