简体   繁体   中英

How can a sub-query refer to a table outside it?

I am trying to understand how a sub-query within a JOIN can refer to a field in the upper query.

The vehicles table stores the current information on the vehicles used in a company; all the vehicle history is stored in a table named vehicles_aud whose structure is completely identical to the vehicles table but also includes a reference to another table, called revisions, which stores the info about who, when, why, etc. made a change to the main table.

To get the very last action done to the vehicle, a very simple JOIN like this is used:

SELECT *
FROM vehicles v
    JOIN vehicles_aud vu ON vu.id=v.id AND vu.revision_id=(
        SELECT max(revision_id)
        from vehicles_aud
        WHERE id=v.id
    )
    JOIN revisions r ON r.id=vu.revision_id

Please don't mind the asterisk in the SELECT section: I'm sure specifying any real fields here doesn't make much sense for my question below. To be precise, this query can also be re-written the following way for better understanding:

SELECT *
FROM vehicles v
    CROSS APPLY (
        SELECT TOP 1 *
        FROM vehicles_aud
        WHERE id=v.id
        ORDER BY id DESC
    ) vu
    JOIN revisions r ON r.id=vu.revision_id

In the second example, JOIN is not applicable.

I assume the sub-query in the first example should be used with the CROSS APPLY operator because it refers to the id field in the vehicles table outside the sub-query, but IRL the query with the JOIN like above works well. And I doubt how that could be ever possible without CROSS APPLY? How, I mean, in what cases and what circumstances can a sub-query refer to fields of a table outside the sub-query?

Using analytic functions is one way to go here:

SELECT TOP 1 WITH TIES *
FROM vehicles v
INNER JOIN vehicles_aud vu ON vu.id = v.id
INNER JOIN revisions r ON r.id = vu.revision_id
ORDER BY ROW_NUMBER() OVER (PARTITION BY v.id ORDER BY vu.revision_id DESC);

The above query will return all records having the maximum revision_id value per group of records sharing the same vehicles.id value.

I'm not sure, if this will really answer your question...

In short: Any kind of JOIN will create two result sets and match them with the given condition, while any kind of APPLY will call the operation row-by-row . If the APPLY returns more than one row, a result set is added (similar to a JOIN), while with single row results the engine simply adds the columns.

The reality will be much more complicated.

The engine is very smart and will decide for the best plan after checking statistics, indexes, existing plans and so on. It is very likely, that the real plan you get is not what you'd expect. And it is fairly likely that the plan you get might be the same for seemingly different queries.

Try the following with "include actual plans" switched on:

USE master;
GO
CREATE DATABASE testPlan;
GO
USE testPlan;
GO

CREATE TABLE t1 (ID INT IDENTITY CONSTRAINT pk PRIMARY KEY, SomeValue VARCHAR(100));
INSERT INTO t1 VALUES('MaxVal will be 100'),('MaxVal will be 200'),('MaxVal will be 300');
GO

CREATE TABLE t2(fkID INT CONSTRAINT fk FOREIGN KEY REFERENCES t1(ID),TheValue INT);
INSERT INTO t2 VALUES(1,1),(1,2),(1,100)
                    ,(2,1),(2,2),(2,200)
                    ,(3,1),(3,2),(3,300);
GO

--a scalar computation using MAX()
SELECT *
      ,(SELECT MAX(t2.TheValue) FROM t2 WHERE t1.ID=t2.fkID) AS MaxVal
FROM t1

--the same as above, but with APPLY
SELECT *
FROM t1
CROSS APPLY(SELECT MAX(t2.TheValue) FROM t2 WHERE t1.ID=t2.fkID) A(MaxVal)

--Now we pick the TOP 1 after an ORDER BY
SELECT *
      ,(SELECT TOP 1 t2.TheValue FROM t2 WHERE t1.ID=t2.fkID ORDER BY t2.TheValue DESC) AS MaxVal
FROM t1

--and again the same with APPLY
SELECT *
FROM t1
CROSS APPLY(SELECT TOP 1 t2.TheValue FROM t2 WHERE t1.ID=t2.fkID ORDER BY t2.TheValue DESC) A(MaxVal)

--Tim's approach using the very slick TOP 1 WITH TIES approach
SELECT TOP 1 WITH TIES *
FROM t1 INNER JOIN t2 ON t1.ID=t2.fkID
ORDER BY ROW_NUMBER() OVER(PARTITION BY t1.ID ORDER BY t2.TheValue DESC);

GO
USE master;
GO
--carefull with real data!
--DROP DATABASE testPlan;
GO

The plan for the "scalar MAX" uses a table scan on 27(!) rows, reduced to 9. The same approach with APPLY has the same plan. The engine is smart enough to see, that this will not need a fully blown result set. As a side note: You can use MaxVal like a variable in the query, very helpfull...

The plan with TOP 1 in a sub-query is the most expensive in this tiny test. It starts with the same as above (table scan with 27 rows, reduced to 9), but has to add a sort operation. The variation with APPLY is roughly the same.

The approach with TOP 1 WITH TIES takes 9 rows of t2 and sorts them. The following operation is done against 9 rows. One more sort and the reduction to the TOP rows.

In this case the first is the fastest - by far.

But in (your) reality the actual behavior will depend on existing indexes, statistics and the actual row counts. Furthermore you have one additional level (one more table) in between. The more complex a query is, the harder it will be for the optimizer to find the best plan.

Conclusion

If performance matters, then race your horses and do the measurements. If performance is not so important take the query which is easier to read, understand and maintain.

This is your first query:

SELECT *
FROM vehicles v JOIN
     vehicles_aud va
     ON va.id = v.id AND
        va.revision_id = (SELECT MAX(va2.revision_id)
                          FROM vehicles_aud va2
                          WHERE va2.id = v.id
--------------------------------^
                         ) JOIN
     revisions r
     ON r.id = va.revision_id;

I assume your question is about this clause. This is a correlation clause in a correlated subquery . The use of table aliases clarifies what is happening.

Logically, what is happening is that for each row in the outer query, the inner query is run with a separate value for va.id As you seem to know, it pulls the most recent value of revision_id .

Some people have an unnatural bias against correlated subqueries, thinking that the database actually cycles through all the rows. Remember, SQL is a descriptive language. Although that describes what the processing is doing, that is not what actually happens in general. In particular, correlated subqueries can be the most efficient mechanism under some circumstances.

A more "colloquial" way to write the query would use window functions:

SELECT *
FROM vehicles v JOIN
     (SELECT va.*,
             ROW_NUMBER() OVER (PARTITION BY va.id ORDER BY va2.revision_id DESC) as seqnum
      FROM vehicles_aud va
     ) va
     ON va.id = v.id AND
        va.seqnum = 1 JOIN
     revisions r
     ON r.id = va.revision_id;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM