简体   繁体   中英

Why does LEFT OUTER JOIN produce the same results as INNER JOIN in DBEAVER when querying from Redshift?

I have 2 tables. t1.uuid has 630,000 distinct values. t2.uuid has 300,000 distinct values.

When I run

SELECT 
    t1.uuid
    , t2.uuid
FROM 
    t1 --630,000 uuids
    LEFT OUTER JOIN t2 -- 300,000 uuids
        ON t1.uuid = t2.uuid 
WHERE
    t2.uuid IS NULL

There are no results.

Edit1: For clarification: t1 looks like:

uuid
ufo123
abc456
def789

t2 looks like

uuid
ufo123
def789

Every record in t2 has a match in t1. Not every record in t1 has a match in t2 (as indicated by the size of the tables.

I expect the join to result in:

t1.uuid t2.uuid
ufo123 ufo123
abc456 NULL
def789 def789

And the result of my query to by:

t1.uuid t2.uuid
abc456 NULL

But instead I get no results. I'm not sure if there is a communication issue between Redshift and DBEAVER to produce this behavior.

Edit2: I ran the following which resulted in only 300,000 records (same as only t2):

SELECT 
    COUNT(DISTINCT t1.uuid)
FROM 
    t1 --630,000 uuids
    FULL JOIN t2 -- 300,000 uuids
        ON t1.uuid = t2.uuid 

This is completely wrong.

The problem is that you cannot check NULL with equal sign in Redshift. "NULL = NULL" is false.

You need to expand your JOIN ON clause (I left it general case which isn't needed with you WHERE clause):

SELECT 
    t1.uuid
    , t2.uuid
FROM 
    t1 --630,000 uuids
    LEFT OUTER JOIN t2 -- 300,000 uuids
        ON (t1.uuid = t2.uuid) OR (t1.uuid IS NULL AND t2.uuid IS NULL)
WHERE
    t2.uuid IS NULL

The problem is that when both t1 and t2 have NULL uuid these will all join to each other and could explode your results. So you may want to check this in a different way if there are a lot of NULLs.

As for an explanation let me address the "why" of this. This has to do with the Redshift inferring a WHERE clause from your SQL. It goes like this - if t2.uuid = X and t1.uuid = t2.uuid then Redshift should be able to apply the WHERE clause t1.uuid = X. But this produces the null set give the reasons above. Redshift does this to reduce the scanned data as much as possible.

Update:

Note that you are recreating the EXCEPT clause with this LEFT OUTER JOIN. You will likely get better results with EXCEPT.

Because you placed the predicate t2.uuid IS NULL in the WHERE clause, rather than in the ON clause. In the place where it is, this predicate silently converts the outer join into an inner join.

To make the join work as an outer join, move the predicate to the ON clause, as in:

SELECT 
    t1.uuid
    , t2.uuid
FROM 
    t1 --630,000 uuids
    LEFT OUTER JOIN t2 -- 300,000 uuids
        ON t1.uuid = t2.uuid 
       AND t2.uuid IS NULL

Your logic looks fine. Testing on a sample data. 3 records in table a and 1 record in table b . Output is 2 records that don't match.

WITH t1 AS (SELECT 1 AS col UNION SELECT 2 UNION SELECT 3)
   , t2 AS (SELECT 1 AS col)
SELECT *
FROM t1
     LEFT JOIN t2 ON t1.col = t2.col
where t2.col is null

Output

+------+-------+
|t1.col|t2.col |
+------+-------+
|2     |NULL   |
|3     |NULL   |
+------+-------+

I'd recommend checking the data that it's matching on. As Pelayo Martinez mentioned, it's highly likely your LEFT table is T2 instead of T1. Only then it'll be a no row result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM