SEMI JOIN on AWS Athena

Question

I wanted to know if there is a way to use SEMI JOIN on AWS Athena (managed presto). I want to try reducing data scanned / improve query performance.

In my case I know there is EXACTLY ONE row in one side of the join, and I thought if there is a way to instruct the engine about it...

Answer 1

It would be helpful if you posted an example of what you wanted to achieve and how it scans too much. Your question is very broad and hard to answer.

If I understand you correctly, I think you can achieve what you are referring to by doing something like:

SELECT *
FROM table1
WHERE something IN (SELECT something FROM table2 WHERE col1 = 'the thing' LIMIT 1)

But whether or not it will reduce the amount of data scanned depends on your specific circumstances. The idea behind the query above is that it makes sure that Athena only scans table2 until it finds the particular row you want to join in. If you're unlucky it will still scan the whole table because it can't find the value, or the value is at the end.

You can also use … WHERE EXISTS (SELECT … , but according to this Presto issue it is translated into a join and could mean that the whole table is read – although with a LIMIT that might not be the case.

SEMI JOIN on AWS Athena

Question

1 answers

solution1
0 2019-07-07 12:08:50

SEMI JOIN on AWS Athena

Question

1 answers

solution1 0 2019-07-07 12:08:50

solution1
0 2019-07-07 12:08:50