简体   繁体   English

AWS Athena上的SEMI JOIN

[英]SEMI JOIN on AWS Athena

I wanted to know if there is a way to use SEMI JOIN on AWS Athena (managed presto). 我想知道是否有一种方法可以在AWS Athena(托管的presto)上使用SEMI JOIN。 I want to try reducing data scanned / improve query performance. 我想尝试减少扫描的数据/提高查询性能。

In my case I know there is EXACTLY ONE row in one side of the join, and I thought if there is a way to instruct the engine about it... 就我而言,我知道联接的一侧恰好有一行,并且我想是否有一种方法可以指示引擎有关此信息……

It would be helpful if you posted an example of what you wanted to achieve and how it scans too much. 如果您发布了一个示例,说明您想实现什么以及如何进行过多扫描,这将很有帮助。 Your question is very broad and hard to answer. 您的问题很广泛,很难回答。

If I understand you correctly, I think you can achieve what you are referring to by doing something like: 如果我对您的理解正确,我认为您可以通过执行以下操作来达到您所指的目的:

SELECT *
FROM table1
WHERE something IN (SELECT something FROM table2 WHERE col1 = 'the thing' LIMIT 1)

But whether or not it will reduce the amount of data scanned depends on your specific circumstances. 但是,是否要减少扫描的数据量取决于您的具体情况。 The idea behind the query above is that it makes sure that Athena only scans table2 until it finds the particular row you want to join in. If you're unlucky it will still scan the whole table because it can't find the value, or the value is at the end. 上面查询背后的想法是,确保Athena仅扫描table2直到找到您要加入的特定行。如果不走运,它仍将扫描整个表,因为它找不到值,或者该值在末尾。

You can also use … WHERE EXISTS (SELECT … , but according to this Presto issue it is translated into a join and could mean that the whole table is read – although with a LIMIT that might not be the case. 您也可以使用… WHERE EXISTS (SELECT … ,但是根据此Presto问题,它被转换为联接,并可能意味着读取了整个表-尽管使用LIMIT可能不是这种情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM