分区 Athena 表中的子查询

Question

I am using partitions in Athena.我在 Athena 中使用分区。 I have a partition called snapshot, and when I call a query as such:我有一个名为快照的分区，当我这样调用查询时：

select * from mytable where snapshot = '2020-06-25'

Then, as expected only the specified partition is scanned and my query is fast.然后，正如预期的那样，只扫描指定的分区并且我的查询很快。 However, if I use a subquery which returns a single date, it is slooow:但是，如果我使用返回单个日期的子查询，它会很慢：

select * from mytable where snapshot = (select '2020-06-25')

The above actually scans all partitions and not only the specified date, and results in very low performance.上面实际上扫描了所有分区而不仅仅是指定的日期，并且导致性能非常低。

My question is can I use a subquery to specify partitions and increase performance.我的问题是我可以使用子查询来指定分区并提高性能。 I need to use a subsquery to add some custom logic which returns a date based on some criteria.我需要使用子查询来添加一些自定义逻辑，该逻辑根据某些条件返回日期。

Answer 1

Edit:编辑：

Trino 356 is able to inline such queries, see https://github.com/trinodb/trino/issues/4231#issuecomment-845733371 Trino 356 能够内联此类查询，请参阅https://github.com/trinodb/trino/issues/4231#issuecomment-845733371

Older answer:较旧的答案：

Presto still does not inline trivial subquery like (select '2020-06-25') . Presto 仍然没有像(select '2020-06-25')那样内联琐碎的子查询。 This is tracked by https://github.com/trinodb/trino/issues/4231 .这由https://github.com/trinodb/trino/issues/4231跟踪。 Thus, you should not expect Athena to inline, as it's based on Presto.172.因此，您不应期望 Athena 内联，因为它基于 Presto.172。

I need to use a subsquery to add some custom logic which returns a date based on some criteria.我需要使用子查询来添加一些自定义逻辑，该逻辑根据某些条件返回日期。

If your query is going to be more sophisticated, not a constant expression, it will not be inlined anyway.如果您的查询将更复杂，而不是常量表达式，那么无论如何它都不会被内联。 If snapshot is a partition key, then you could leverage a recently added feature -- dynamic partition pruning.如果snapshot是分区键，那么您可以利用最近添加的功能——动态分区修剪。 Read more athttps://trino.io/blog/2020/06/14/dynamic-partition-pruning.html .阅读https://trino.io/blog/2020/06/14/dynamic-partition-pruning.html了解更多信息。 This of course assumes you can choose Presto version.这当然假设您可以选择 Presto 版本。

If you are constraint to Athena, your only option is to evaluate the subquery outside of the main query (separately), and pass it back to the main query as a constant (eg literal).如果您受限于 Athena，您唯一的选择是在主查询之外（单独）评估子查询，并将其作为常量（例如文字）传递回主查询。

Answer 2

The Athena 2.0 released in late 2020 seems to have improved their push_down_predicate handling to support subquery. 2020 年末发布的 Athena 2.0 似乎改进了它们的 push_down_predicate 处理以支持子查询。

Here is their related statement from https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference.html#engine-versions-reference-0002这是他们来自https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference.html#engine-versions-reference-0002的相关声明

Predicate inference and pushdown – Predicate inference and pushdown extended for queries that use a <symbol> IN <subquery> predicate.谓词推理和下推——为使用 <symbol> IN <subquery> 谓词的查询扩展了谓词推理和下推。

My test with our own Athena table indicates this is indeed the case.我对我们自己的 Athena 表的测试表明情况确实如此。 My test query is roughly as below我的测试查询大致如下

SELECT *
FROM table_partitioned_by_scrape_date
WHERE scrape_date = (
  SELECT max(scrape_date) 
  FROM table_partitioned_by_scrape_date
)

From the bytes scanned by the query, I can tell Athena indeed only scanned the partition with the latest scrape_date.从查询扫描的字节中，我可以告诉 Athena 确实只扫描了具有最新 scrape_date 的分区。

Moreover, I also tested support of push_down_predicate in JOIN clause where the join_to value is result of another query.此外，我还在 JOIN 子句中测试了 push_down_predicate 的支持，其中 join_to 值是另一个查询的结果。 Even though it is not mentioned in the release note, apparently Athena 2.0 now is smart enough also to support this scenario and only scan the latest scrape_date partition.即使在发行说明中没有提到，显然 Athena 2.0 现在也足够聪明，可以支持这种情况，并且只扫描最新的 scrape_date 分区。 I have tested similar query in Athena 1.0 before, it would scan all the partitions instead.我之前在 Athena 1.0 中测试过类似的查询，它会扫描所有分区。 My test query is as below我的测试查询如下

WITH l as (
  SELECT max(scrape_date) as latest_scrape_date
  FROM table_partitioned_by_scrape_date
)
SELECT deckard_id
FROM table_partitioned_by_scrape_date as t
JOIN l ON t.scrape_date = l.latest_scrape_date

分区 Athena 表中的子查询

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-06-25 21:35:57

解决方案2
0 2021-12-09 23:34:29

分区 Athena 表中的子查询

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-06-25 21:35:57

解决方案2 0 2021-12-09 23:34:29

解决方案1
3 已采纳 2020-06-25 21:35:57

解决方案2
0 2021-12-09 23:34:29