为什么在SQL查询中IN IN比IN慢得多

Question

I have found a surprising (at least to me) thing with IN and NOT IN. 我发现IN和NOT IN令人惊讶（至少对我而言）。 When I try to explain first query on PostgreSQL database: 当我尝试解释PostgreSQL数据库上的第一个查询时：

EXPLAIN DELETE
FROM AuditTaskImpl l
WHERE  l.processInstanceId in (select spl.processInstanceId
                               FROM ProcessInstanceLog spl
                               WHERE spl.status not in ( 2, 3))

it tells me this: 它告诉我这个：

Delete on audittaskimpl l  (cost=2794.48..6373.52 rows=50859 width=12)
  ->  Hash Semi Join  (cost=2794.48..6373.52 rows=50859 width=12)
        Hash Cond: (l.processinstanceid = spl.processinstanceid)
        ->  Seq Scan on audittaskimpl l  (cost=0.00..2005.59 rows=50859 width=14)
        ->  Hash  (cost=1909.24..1909.24 rows=50899 width=14)
              ->  Seq Scan on processinstancelog spl  (cost=0.00..1909.24 rows=50899 width=14)
                    Filter: (status <> ALL ('{2,3}'::integer[]))

However, when I change in for not in, which is just a negation: 但是，当我换入in not时，这只是一个否定：

EXPLAIN DELETE
FROM AuditTaskImpl l
WHERE  l.processInstanceId NOT in (select spl.processInstanceId
                               FROM ProcessInstanceLog spl
                               WHERE spl.status not in ( 2, 3))

it tells me this: 它告诉我这个：

Delete on audittaskimpl l  (cost=0.00..63321079.15 rows=25430 width=6)
  ->  Seq Scan on audittaskimpl l  (cost=0.00..63321079.15 rows=25430 width=6)
        Filter: (NOT (SubPlan 1))
        SubPlan 1
          ->  Materialize  (cost=0.00..2362.73 rows=50899 width=8)
                ->  Seq Scan on processinstancelog spl  (cost=0.00..1909.24 rows=50899 width=8)
                      Filter: (status <> ALL ('{2,3}'::integer[]))

As you can see, with IN it uses hash join, which is of course much quicker, but with NOT IN it uses just plain sequential scan row by row. 正如您所看到的，使用IN它使用散列连接，这当然要快得多，但是使用NOT IN它只是逐行使用简单的顺序扫描。 But since NOT IN is just a negation it could use hash join again and just do the opposite: with IN when there is processInstanceId in nested select, add it to the result, when there is not, don't add it, with NOT IN when there is processInstanceId in nested select, don't add it to the result, when there is not, add it to the result. 但是因为NOT IN只是一个否定它可以再次使用散列连接而只是反过来：当嵌套选择中有processInstanceId时使用IN，将其添加到结果中，当没有时，不添加它，使用NOT IN当嵌套选择中有processInstanceId时，不要将其添加到结果中，如果没有，则将其添加到结果中。

So can you perhaps explain why this happens? 你能解释一下为什么会这样吗？ To clarify AuditTaskImpl has processInstanceId attribute which is also present in ProcessInstanceLog table, although there is no foreign key relationship between them. 澄清AuditTaskImpl的processInstanceId属性也存在于ProcessInstanceLog表中，尽管它们之间没有外键关系。

Thanks. 谢谢。

Answer 1

The semantics of NOT IN require that nothing be returned if any value in the subquery is NULL . 如果子查询中的任何值为NULL ，则NOT IN的语义要求不返回任何内容 。 Hence, Postgres needs to look at all the values. 因此，Postgres需要查看所有值。

I strongly recommend never using NOT IN with a subquery. 我强烈建议永远NOT IN子查询中使用NOT IN 。 Always use NOT EXISTS : 始终使用NOT EXISTS ：

DELETE FROM AuditTaskImpl l
    WHERE NOT EXISTS (SELECT 1 
                      FROM ProcessInstanceLog spl
                      WHERE l.processInstanceId = spl.spl.processInstanceId AND
                            spl.status not in (2, 3)
                     );

为什么在SQL查询中IN IN比IN慢得多

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-04-06 21:32:14

为什么在SQL查询中IN IN比IN慢得多

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-04-06 21:32:14

解决方案1
3 已采纳 2018-04-06 21:32:14