简体   繁体   English

在SQL中评估相关子查询

[英]Evaluating a correlated subquery in SQL

I'm having trouble getting my head around evaluating correlated subqueries. 我在评估相关子查询时遇到了麻烦。 An example is using a correlated subquery in SELECT so that GROUP BY isn't needed: 一个例子是在SELECT中使用相关子查询,因此不需要GROUP BY:

Consider the relations: 考虑关系:

Movies : Title, Director Length
Schedule : Theatre, Title

I have the following query 我有以下查询

SELECT S.Theater, MAX(M.Length)
FROM Movies M JOIN Schedule S ON M.Title=S.Title
GROUP BY S.Theater

Which gets the longest film that every theatre is playing. 这是每个剧院播放的最长的电影。 This is the same query without using GROUP BY: 这是不使用GROUP BY的相同查询:

SELECT DISTINCT S.theater,
    (SELECT MAX(M.Length)
    FROM Movies M
    WHERE M.Title=S.Title)
FROM Schedule S

but I don't understand how it quite works. 但我不明白它是如何工作的。

I'd appreciate if anybody could give me an example of how correlated subqueries are evaluated. 如果有人能给我一个关于如何评估相关子查询的例子,我将不胜感激。

Thanks :) 谢谢 :)

From a conceptual standpoint, imagine that the database is going through each row of the result without the subquery: 从概念的角度来看,假设数据库在没有子查询的情况下遍历结果的每一行:

SELECT DISTINCT S.Theater, S.Title
FROM Schedule S

And then, for each one of those, running the subquery for you: 然后,对于每一个,为您运行子查询:

SELECT MAX(M.Length)
FROM Movies M
WHERE M.Title = (whatever S.Title was)

And placing that in as the value. 并把它作为价值。 Really, it's not (conceptually) that different from using a function: 真的,它(概念上)与使用函数不同:

SELECT DISTINCT S.Theater, SUBSTRING(S.Title, 1, 5)
FROM Schedule S

It's just that this function performs a query against another table, instead. 这只是该函数对另一个表执行查询。

I do say conceptually, though. 不过,我确实在概念上说。 The database may be optimizing the correlated query into something more like a join. 数据库可能正在将相关查询优化为更像连接的东西。 Whatever it does internally matters for performance, but doesn't matter as much for understanding the concept. 无论它在内部对性能有何影响,但对于理解这个概念并不重要。

But, it may not return the results you're expecting. 但是,它可能无法返回您期望的结果。 Consider the following data (sorry sqlfiddle seems to be erroring atm): 考虑以下数据(对不起sqlfiddle似乎是错误的atm):

CREATE TABLE Movies (
  Title varchar(255),
  Length int(10) unsigned,
  PRIMARY KEY (Title)
);

CREATE TABLE Schedule (
  Title varchar(255),
  Theater varchar(255),
  PRIMARY KEY (Theater, Title)
);

INSERT INTO Movies
VALUES ('Star Wars', 121);
INSERT INTO Movies
VALUES ('Minions', 91);
INSERT INTO Movies
VALUES ('Up', 96);

INSERT INTO Schedule
VALUES ('Star Wars', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Minions', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Up', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Star Wars', 'Cinema 6');

And then this query: 然后这个查询:

SELECT DISTINCT
  S.Theater,
  (
    SELECT MAX(M.Length)
    FROM Movies M
    WHERE M.Title = S.Title
  ) AS MaxLength
FROM Schedule S;

You'll get this result: 你会得到这个结果:

+----------+-----------+
| Theater  | MaxLength |
+----------+-----------+
| Cinema 6 |       121 |
| Cinema 8 |        91 |
| Cinema 8 |       121 |
| Cinema 8 |        96 |
+----------+-----------+

As you can see, it's not a replacement for GROUP BY (and you can still use GROUP BY), it's just running the subquery for each row. 正如您所看到的,它不是GROUP BY的替代品(您仍然可以使用GROUP BY),它只是为每一行运行子查询。 DISTINCT will only remove duplicates from the result. DISTINCT只会从结果中删除重复项。 It's not giving the "greatest length" per theater anymore, it's just giving each unique movie length associated with the theater name. 它不再为每个剧院提供“最大长度”,它只是给出与剧院名称相关的每个独特电影长度。

PS: You might likely use an ID column of some sort to identify movies, rather than using the Title in the join. PS:你可能会使用某种ID列来识别电影,而不是使用连接中的标题。 This way, if by chance the name of the movie has to be amended, it only needs to change in one place, not all over Schedule too. 这样一来,如果必须修改电影的名称,它只需要在一个地方改变,而不是整个时间表。 Plus, it's faster to join on an ID number than a string. 另外,加入ID号而不是字符串更快。

Conceptually... 概念...

To understand this, first ignore the bit about correlated subquery. 要理解这一点,首先忽略关于相关子查询的位。

Consider the order of operations for a statement like this: 考虑这样的语句的操作顺序:

SELECT t.foo FROM mytable t

MySQL prepares an empty resultset. MySQL准备一个空的结果集。 Rows in the resultset will consist of one column, because there is one expression in the SELECT list. 结果集中的行将包含一列,因为SELECT列表中有一个表达式。 A row is retrieved from mytable. 从mytable中检索一行。 MySQL puts a row into the resultset, using the value from the foo column from the mytable row, assigning it to the foo column in the resultset. MySQL在结果集中添加一行,使用mytable行中foo列的值,将其分配给结果集中的foo列。 Fetch the next row, repeat that same process, until there are no more rows to fetch from the table. 获取下一行,重复相同的过程,直到没有更多行要从表中获取。

Pretty easy stuff. 很简单的东西。 But bear with me. 但请忍受我。

Consider this statement: 请考虑以下声明:

SELECT t.foo AS fi, 'bar' AS fo FROM mytable t

MySQL process that the same way. MySQL进程的方式相同。 Prepare an empty resultset. 准备一个空的结果集。 Rows in the resultset are going to have two columns this time. 结果集中的行这次将有列。 The first column is given the name fi (because we assigned the name fi with an alias). 第一列的名称为fi(因为我们为其指定了名称为fi的别名)。 The second column in rows of the resultset will be named fo, because (again) we assigned an alias. 结果集行中的第二列将命名为fo,因为(再次)我们分配了一个别名。

Now we etch a row from mytable, and insert a row into the resultset. 现在我们从mytable中蚀刻一行,并在结果集中插入一行。 The value of the foo column goes into the column name fi, and the literal string 'bar' goes into the column named fo. foo列的值进入列名fi,文字字符串'bar'进入名为fo的列。 Continue fetching rows and inserting rows into the resultset, until no more rows to fetch. 继续获取行并在结果集中插入行,直到不再需要获取行。

Not too hard. 不是太难。

Next, consider this statement, which looks a little more tricky: 接下来,考虑这个声明,看起来有点棘手:

SELECT t.foo AS fi, (SELECT 'bar') AS fo FROM mytable t

Same thing happens again. 同样的事情再次发生。 Empty resultset. 空结果集。 Rows have two columns, name fi and fo. 行有两列,名称为fi和fo。

Fetch a row from mytable, and insert a row into the resultset. 从mytable中获取一行,并在结果集中插入一行。 The value of foo goes into column fi (just like before.) This is where it gets tricky... for the second column in the resultset, MySQL executes the query inside the parens. foo的值进入列fi(就像之前一样。)这就是它变得棘手的地方......对于结果集中的第二列,MySQL在parens中执行查询。 In this case it's a pretty simple query, we can test that pretty easily to see what it returns. 在这种情况下,这是一个非常简单的查询,我们可以很容易地测试它,看看它返回什么。 Take the result from that query and assign that to the fo column, and insert the row into the resultset. 从该查询中获取结果并将其分配给fo列,并将该行插入结果集。

Still with me? 还在我这儿?

SELECT t.foo AS fi, (SELECT q.tip FROM bartab q LIMIT 1) AS fo FROM mytable 

This is starting to look more complicated. 这开始看起来更复杂。 But it's not really that much different. 但它并没有那么多不同。 The same things happen again. 同样的事情再次发生。 Prepare the empty resultset. 准备空结果集。 Rows will have two columns, one name fi, the other named fo. 行将有两列,一个名称为fi,另一列名为fo。 Fetch a row from mytable. 从mytable中获取一行。 Get the value from foo column, and assign it to the fi column in the result row. 从foo列获取值,并将其分配给结果行中的fi列。 For the fo column, execute the query, and assign the result from the query to the fo column. 对于fo列,执行查询,并将查询结果分配给fo列。 Insert the result row into the resultset. 将结果行插入结果集。 Fetch another row from mytable, a repeat the process. 从mytable中获取另一行,重复该过程。

Here we should stop and notice something. 在这里我们应该停下来注意一些事情 MySQL is picky about that query in the SELECT list. MySQL在SELECT列表中对该查询很挑剔。 Really really picky. 真的很挑剔。 MySQL has restrictions on that. MySQL对此有限制。 The query must return exactly one column. 查询必须只返回一列。 And it cannot return more than one row. 并且它不能返回多行。

In that last example, for the row being inserted into the resultset, MySQL is looking for a single value to assign to the fo column. 在最后一个示例中,对于插入结果集的行,MySQL正在查找要分配给fo列的单个值。 When we think about it that way, it makes sense that the query can't return more than one column... what would MySQL do with the value from the second column? 当我们以这种方式思考时,有意义的是查询不能返回多个列...... MySQL会对第二列的值做什么? And it makes sense that we don't want to return more than one row... what would MySQL do with multiple rows? 而且我们不想返回多行是有意义的...... MySQL会对多行做什么?

MySQL will allow the query to return zero rows. MySQL将允许查询返回零行。 When that happens, MySQL assigns a NULL to the fo column. 当发生这种情况时,MySQL会为fo列分配NULL。

If you have an understanding of that, your 95% of the way there to understanding the correlated subquery. 如果您对此有所了解,那么95%的方式可以理解相关的子查询。

Let's look at another example. 让我们看另一个例子。 Our single line of SQL is getting a little unweildy, so we'll just add some line breaks and spaces to make it easier for us to work with. 我们的单行SQL有点不合适,所以我们只需添加一些换行符和空格,以便我们更容易使用。 The extra spaces and linebreaks don't change the meaning of our statement. 额外的空格和换行符不会改变我们陈述的含义。

SELECT t.foo AS fi
     , ( SELECT q.tip
           FROM bartab q
          WHERE q.col = t.foo
          ORDER BY q.tip DESC
          LIMIT 1
        ) AS fo
   FROM mytable t

Okay, that looks a lot more complicated. 好的,这看起来要复杂得多。 But is it really? 但它真的吗? It's the same thing again. 又到了同样的事情。 Prepare an empty resultset. 准备一个空的结果集。 Rows will have two columns, fi and fo. 行将有两列,fi和fo。 Fetch a row from mytable, and get a row ready to insert into the resultset. 从mytable中获取一行,并准备好一行插入结果集。 Copy the value from the foo column, assign it to the fi column. 复制foo列中的值,将其指定给fi列。 And for the fo column, execute the query, take the single value returned by the query to the fo column, and push the row into the resultset. 对于fo列,执行查询,将查询返回的单个值带到fo列,然后将行推入结果集。 Fetch the next row from mytable, and repeat. 从mytable中获取下一行,然后重复。

To explain (finall!) the part about "correlated". 解释(finall!)关于“相关”的部分。

That query we are going to run to get the result for the fo column. 我们将运行该查询以获取fo列的结果。 That contains a reference to a column from the outer table. 它包含对外部表中列的引用 t.foo . t.foo In this example that appears in the WHERE clause; 在此示例中出现在WHERE子句中; it doesn't have to, it could appear anywhere in the statement. 它没有必要,它可以出现在声明的任何地方。

What MySQL does with that, when it runs that subquery, it passes in the value of the foo column, into the query. MySQL使用它做什么,当它运行该子查询时,它将foo列的值传递给查询。 If the row we just fetched from mytable has a value of 42 in the foo column... that subquery is equivalent to 如果我们刚从mytable获取的行在foo列中的值为42 ...该子查询等效于

         SELECT q.tip
           FROM bartab q
          WHERE q.col =   42
          ORDER BY q.tip DESC
          LIMIT 1

But since we're not passing in the literal value of 42, what we're passing in is values from the row in the outer query, the result returned by our subquery is "related" to the row we're processing in the outer query. 但是因为我们没有传入42的字面值,我们传入的是外部查询中的行的值,我们的子查询返回的结果与我们在外部处理的行“相关”查询。

We could be a lot more complicated in our subquery, as long as we remember the rule about the subquery in the SELECT list... it has to return exactly one column, and at most one row. 我们的子查询中可能要复杂得多,只要我们记住SELECT列表中有关子查询的规则......它必须返回一列,最多只返回一行。 It returns at most one value. 它最多返回一个值。

Correlated subqueries can appear in parts of the statement other than the SELECT list, such as the WHERE clause. 相关子查询可以出现在SELECT列表以外的语句的某些部分中,例如WHERE子句。 The same general concept applies. 同样的一般概念适用。 For each row processed by the outer query, the values of the column(s) from that row are passed in to the subquery. 对于外部查询处理的每一行 ,该中的列的值将传递到子查询。 The result returned from the subquery is related to the row being processed in the outer query. 从子查询返回的结果外部查询中正在处理的行相关


The discussion omits all the steps before the actual execution... parsing the statament into tokens, performing the syntax check (keywords and identifiers in the right place). 讨论省略了实际执行之前的所有步骤...将statament解析为令牌,执行语法检查(关键字和标识符在正确的位置)。 Then performing the semantics check (does mytable exist, does the user have select privilege on it, does the column foo exist in mytable). 然后执行语义检查(mytable是否存在,用户是否具有select权限,mytable中是否存在列foo)。 Then determining the access plan. 然后确定访问计划。 And in the execution, obtaining the required locks, and so on. 并在执行中,获取所需的锁,等等。 All that happens with every statement we execute.) 我们执行的每个语句都会发生这种情况。)

And we're going to not discuss the kinds of horrendous performance issues we can create with correlated subqueries. 我们不打算讨论我们可以使用相关子查询创建的可怕性能问题。 Though the previous discussion should give a clue. 虽然前面的讨论应该提供一个线索。 Since the subquery is executed for every row we're putting into the resultset (if it's in the SELECT list of our outer query), or is being executed for every row that is accessed by the outer query... if the outer query is returning 40,000 rows, that means our correlated subquery is going to be executed 40,000 times. 由于子查询是针对我们放入结果集的每一行执行的(如果它在我们的外部查询的SELECT列表中),或者正在为外部查询访问的每一行执行...如果外部查询是返回40,000行,这意味着我们的相关子查询将被执行40,000次。 So we better well make sure that subquery executes fast. 所以我们最好确保子查询快速执行。 Even when it executes fast, we're still going to execute it 40,000 times. 即使它执行得很快,我们仍然要执行它40,000次。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM