简体   繁体   English

SQL Server IN 与 EXISTS 性能

[英]SQL Server IN vs. EXISTS Performance

I'm curious which of the following below would be more efficient?我很好奇以下哪个更有效?

I've always been a bit cautious about using IN because I believe SQL Server turns the result set into a big IF statement.我一直对使用IN有点谨慎,因为我相信 SQL Server 会将结果集变成一个大的IF语句。 For a large result set, this could result in poor performance.对于大型结果集,这可能会导致性能不佳。 For small result sets, I'm not sure either is preferable.对于小型结果集,我不确定两者是否更可取。 For large result sets, wouldn't EXISTS be more efficient?对于大型结果集, EXISTS不是更有效吗?

WHERE EXISTS (SELECT * FROM Base WHERE bx.BoxID = Base.BoxID AND [Rank] = 2)

vs.对比

WHERE bx.BoxID IN (SELECT BoxID FROM Base WHERE [Rank = 2])

EXISTS will be faster because once the engine has found a hit, it will quit looking as the condition has proved true. EXISTS会更快,因为一旦引擎找到命中,它就会退出寻找,因为条件已被证明是正确的。

With IN , it will collect all the results from the sub-query before further processing.使用IN ,它将在进一步处理之前从子查询中收集所有结果。

The accepted answer is shortsighted and the question a bit loose in that:公认的答案是短视的,而且问题有点松散:

1) Neither explicitly mention whether a covering index is present in the left, right, or both sides. 1) 均未明确提及覆盖索引是否存在于左侧、右侧或两侧。

2) Neither takes into account the size of input left side set and input right side set. 2)都不考虑输入左侧集和输入右侧集的大小。
(The question just mentions an overall large result set). (这个问题只是提到了一个整体的大结果集)。

I believe the optimizer is smart enough to convert between "in" vs "exists" when there is a significant cost difference due to (1) and (2), otherwise it may just be used as a hint (eg exists to encourage use of an a seekable index on the right side).我相信优化器足够聪明,可以在由于 (1) 和 (2) 导致成本差异显着时在“in”与“exists”之间进行转换,否则它可能只是用作提示(例如,存在以鼓励使用右侧的可搜索索引)。

Both forms can be converted to join forms internally, have the join order reversed, and run as loop, hash or merge--based on the estimated row counts (left and right) and index existence in left, right, or both sides.两种形式都可以在内部转换为连接形式,颠倒连接顺序,并作为循环、散列或合并运行——基于估计的行数(左和右)和左、右或两侧的索引存在。

I've done some testing on SQL Server 2005 and 2008, and on both the EXISTS and the IN come back with the exact same actual execution plan, as other have stated.我已经对 SQL Server 2005 和 2008 进行了一些测试,并且在 EXISTS 和 IN 上都返回了完全相同的实际执行计划,正如其他人所说的那样。 The Optimizer is optimal.优化器是最优的。 :) :)

Something to be aware of though, EXISTS, IN, and JOIN can sometimes return different results if you don't phrase your query just right: http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx不过需要注意的是,如果您的查询措辞不当,EXISTS、IN 和 JOIN 有时会返回不同的结果: http ://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210 .aspx

I'd go with EXISTS over IN, see below link:我会选择 EXISTS over IN,请参见下面的链接:

SQL Server: JOIN vs IN vs EXISTS - the logical difference SQL Server:JOIN vs IN vs EXISTS - 逻辑差异

There is a common misconception that IN behaves equally to EXISTS or JOIN in terms of returned results.有一个常见的误解,认为 IN 在返回结果方面与 EXISTS 或 JOIN 的行为相同。 This is simply not true.这是不正确的。

IN: Returns true if a specified value matches any value in a subquery or a list. IN:如果指定的值与子查询或列表中的任何值匹配,则返回 true。

Exists: Returns true if a subquery contains any rows. Exists:如果子查询包含任何行,则返回 true。

Join: Joins 2 resultsets on the joining column.加入:在加入列上加入 2 个结果集。

Blog credit: https://stackoverflow.com/users/31345/mladen-prajdic博客信用: https ://stackoverflow.com/users/31345/mladen-prajdic

There are many misleading answers answers here, including the highly upvoted one (although I don't believe their ops meant harm).这里有许多误导性的答案,包括高度赞成的答案(尽管我不相信他们的操作意味着伤害)。 The short answer is: These are the same.简短的回答是:这些都是一样的。

There are many keywords in the (T-)SQL language, but in the end, the only thing that really happens on the hardware is the operations as seen in the execution query plan. (T-)SQL 语言中有很多关键字,但最终真正发生在硬件上的只有执行查询计划中看到的操作。

The relational (maths theory) operation we do when we invoke [NOT] IN and [NOT] EXISTS is the semi join (anti-join when using NOT ).当我们调用[NOT] IN[NOT] EXISTS时,我们执行的关系(数学理论)操作是半联接(使用NOT时是反联接)。 It is not a coincidence that the corresponding sql-server operations have the same name .相应的 sql-server 操作具有相同的名称并非巧合。 There is no operation that mentions IN or EXISTS anywhere - only (anti-)semi joins.没有在任何地方提到INEXISTS的操作 - 只有(反)半连接。 Thus, there is no way that a logically-equivalent IN vs EXISTS choice could affect performance because there is one and only way, the (anti)semi join execution operation, to get their results .因此,逻辑上等效的INEXISTS选择不会影响性能,因为只有一种方法,即(反)半连接执行操作来获得它们的结果

An example:一个例子:

Query 1 ( plan )查询 1(计划

select * from dt where dt.customer in (select c.code from customer c where c.active=0)

Query 2 ( plan )查询 2(计划

select * from dt where exists (select 1 from customer c where c.code=dt.customer and c.active=0)

在这些情况下,执行计划通常是相同的,但是在您了解优化器如何影响索引等的所有其他方面之前,您真的永远不会知道。

So, IN is not the same as EXISTS nor it will produce the same execution plan.因此,IN 与 EXISTS 不同,也不会产生相同的执行计划。

Usually EXISTS is used in a correlated subquery, that means you will JOIN the EXISTS inner query with your outer query.通常在相关子查询中使用 EXISTS,这意味着您将在外部查询中加入 EXISTS 内部查询。 That will add more steps to produce a result as you need to solve the outer query joins and the inner query joins then match their where clauses to join both.这将添加更多步骤来生成结果,因为您需要解决外部查询连接和内部查询连接,然后匹配它们的 where 子句以连接两者。

Usually IN is used without correlating the inner query with the outer query, and that can be solved in only one step (in the best case scenario).通常使用 IN 时不会将内部查询与外部查询相关联,并且只需一步即可解决(在最佳情况下)。

Consider this:考虑一下:

  1. If you use IN and the inner query result is millions of rows of distinct values, it will probably perform SLOWER than EXISTS given that the EXISTS query is performant (has the right indexes to join with the outer query).如果您使用 IN 并且内部查询结果是数百万行不同的值,则它可能会执行比 EXISTS 慢,因为 EXISTS 查询是高性能的(具有与外部查询连接的正确索引)。

  2. If you use EXISTS and the join with your outer query is complex (takes more time to perform, no suitable indexes) it will slow the query by the number of rows in the outer table, sometimes the estimated time to complete can be in days.如果您使用 EXISTS 并且与外部查询的连接很复杂(需要更多时间来执行,没有合适的索引),那么查询速度会因外部表中的行数而变慢,有时估计的完成时间可能以天为单位。 If the number of rows is acceptable for your given hardware, or the cardinality of data is correct (for example fewer DISTINCT values in a large data set) IN can perform faster than EXISTS.如果您的给定硬件可以接受行数,或者数据的基数正确(例如,大型数据集中的 DISTINCT 值较少)IN 可以比 EXISTS 执行得更快。

  3. All of the above will be noted when you have a fair amount of rows on each table (by fair I mean something that exceeds your CPU processing and/or ram thresholds for caching).当您在每个表上有相当数量的行时,将注意到上述所有内容(公平地说,我的意思是超出您的 CPU 处理和/或缓存的 ram 阈值)。

So the ANSWER is it DEPENDS.所以答案取决于它。 You can write a complex query inside IN or EXISTS, but as a rule of thumb, you should try to use IN with a limited set of distinct values and EXISTS when you have a lot of rows with a lot of distinct values.您可以在 IN 或 EXISTS 中编写复杂的查询,但根据经验,您应该尝试将 IN 与一组有限的不同值一起使用,而当您有很多行具有很多不同的值时,您应该尝试使用 EXISTS。

The trick is to limit the number of rows to be scanned.诀窍是限制要扫描的行数。

Regards,问候,

MarianoC马里亚诺

To optimize the EXISTS , be very literal;要优化EXISTS ,要非常直截了当; something just has to be there, but you don't actually need any data returned from the correlated sub-query.有些东西必须在那里,但您实际上并不需要从相关子查询返回的任何数据。 You're just evaluating a Boolean condition.您只是在评估布尔条件。

So:所以:

WHERE EXISTS (SELECT TOP 1 1 FROM Base WHERE bx.BoxID = Base.BoxID AND [Rank] = 2)

Because the correlated sub-query is RBAR , the first result hit makes the condition true, and it is processed no further.因为相关的子查询是RBAR ,所以第一个结果命中使条件为真,不再进一步处理。

I know that this is a very old question but I think my answer would add some tips.我知道这是一个非常古老的问题,但我认为我的回答会增加一些提示。

I just came across a blog on mssqltips sql exists vs in vs join and it turns out that it is generally the same performance wise.我刚刚看到一篇关于mssqltips sql exists vs in vs join的博客,结果证明它在性能方面通常是相同的。

But the downside of one vs the other are as follows:但是一个与另一个的缺点如下:

  1. The in statement has a downside that it can only compare the two tables on one column. in语句有一个缺点,它只能比较一列上的两个表。

  2. The join statement will run on duplicate values, while in and exists will ignore duplicates. join语句将对重复值运行,而inexists将忽略重复值。

But when you look at the execution time there is no big difference.但是当您查看执行时间时,并没有太大的区别。

The interesting thing is when you create an index on the table, the execution from the join is better.有趣的是,当您在表上创建index时, join的执行会更好。

And I think that join has another upside that it's easier to write and understand especially for newcomers.而且我认为join还有另一个好处,那就是它更容易编写和理解,尤其是对于新手来说。

Off the top of my head and not guaranteed to be correct: I believe the second will be faster in this case.在我的脑海中并且不能保证是正确的:我相信在这种情况下第二个会更快。

  1. In the first, the correlated subquery will likely cause the subquery to be run for each row.首先,相关子查询可能会导致为每一行运行子查询。
  2. In the second example, the subquery should only run once, since not correlated.在第二个示例中,子查询应该只运行一次,因为不相关。
  3. In the second example, the IN will short-circuit as soon as it finds a match.在第二个示例中, IN将在找到匹配项后立即短路。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM