为什么添加ORDER BY可以大大加快查询速度？

Question

I've discovered some very strange and counter-intuitive behaviour in PostgreSQL. 我在PostgreSQL中发现了一些非常奇怪和违反直觉的行为。

I have a query structure as follows. 我有一个查询结构，如下所示。 I am selecting both the IDs and the count from a subquery. 我正在从子查询中选择ID和计数。 The subquery does the filtering, joining, counting, but only orders by the IDs, since I'm using DISTINCT ON() to only get unique IDs. 子查询执行过滤，连接，计数，但仅按ID排序，因为我使用DISTINCT ON（）仅获得唯一的ID。

The outer query then does the proper ordering, and any limits and offsets as needed. 然后，外部查询将进行适当的排序，并根据需要进行任何限制和偏移。 Here is an example of what the query structure looks like: 这是查询结构的示例：

SELECT s.id, s.item_count
FROM (
    SELECT DISTINCT ON (work_items.id) work_items.id
        , work_item_states.disposition AS disposition
        , COUNT(work_items.id) OVER () AS item_count
    FROM work_items
    JOIN work_item_states ON work_item_states.work_item_refer = work_items.id
    WHERE work_item_states.disposition = 'cancelled'
    ORDER BY work_items.id
) AS s
ORDER BY s.disposition
LIMIT 50
OFFSET 0

I've discovered something strange however. 但是我发现了一些奇怪的东西。 My database has several million entries, so overall queries aren't the fastest. 我的数据库有数百万个条目，因此总体查询并不是最快的。 But when I remove the ORDER BY clause in the OUTER query, it drastically slows down the query time. 但是，当我在OUTER查询中删除ORDER BY子句时，它将大大减慢查询时间。

However, if I also remove the LIMIT clause, it becomes fast again, despite the fact that this example query is returning 800 000+ results. 但是，如果我还删除了LIMIT子句，尽管该示例查询返回了800 000+个结果，但它又变得很快。

In summary, for the outer query: 总之，对于外部查询：

ORDER BY AND LIMIT - Fast 排序和限制 -快速

...
) AS s
ORDER BY s.disposition
LIMIT 50
OFFSET 0

ONLY LIMIT - Very slow 唯一的限制 -很慢

...
) AS s
LIMIT 50
OFFSET 0

ONLY ORDER BY - Fast, despite 800 000 results 仅订购 -尽管有80万笔结果，但速度很快

...
) AS s
ORDER BY s.disposition
OFFSET 0

NEITHER - Fast, despite 800 000 results 极好-尽管有80万笔结果，但速度很快

...
) AS s
OFFSET 0

To give an idea of how much slower only having the LIMIT clause is, with both, neither, or just ORDER BY, the queries take no more than about 10 seconds. 为了使仅具有LIMIT子句的速度减慢多少，或者两者都不发生，或者只是ORDER BY，查询要花费的时间不超过10秒。

With only the LIMIT clause however, the queries take about a minute 15, over 7 times as long! 但是，仅使用LIMIT子句，查询大约需要一分钟15分钟，是查询时间的7倍！

You'd think that ORDER BY would instead slow things down, as it has to sort the results of the subquery, but it seems that isn't the case. 您可能会认为ORDER BY会减慢速度，因为它必须对子查询的结果进行排序，但事实并非如此。 It's very counter-intuitive. 这是非常违反直觉的。

If someone knows what's going on behind the scenes here, I'd greatly appreciate them shedding some light on this. 如果有人知道这里幕后发生的事情，我将不胜感激他们对此有所了解。

Thanks 谢谢

EDIT - Added execution plans for statements: 编辑 -添加了语句的执行计划：

ORDER BY and LIMIT execution plan ORDER BY和LIMIT执行计划

Limit  (cost=518486.52..518486.65 rows=50 width=53)
  ->  Sort  (cost=518486.52..520495.59 rows=803628 width=53)
        Sort Key: s.disposition
        ->  Subquery Scan on s  (cost=479736.16..491790.58 rows=803628 width=53)
              ->  Unique  (cost=479736.16..483754.30 rows=803628 width=53)
                    ->  Sort  (cost=479736.16..481745.23 rows=803628 width=53)
                          Sort Key: work_items.id
                          ->  WindowAgg  (cost=136262.98..345979.65 rows=803628 width=53)
                                ->  Hash Join  (cost=136262.98..335934.30 rows=803628 width=45)
                                      Hash Cond: (work_items.id = work_item_states.work_item_refer)
                                      ->  Seq Scan on work_items  (cost=0.00..106679.48 rows=4020148 width=37)
                                      ->  Hash  (cost=119152.97..119152.97 rows=803681 width=45)
                                            ->  Bitmap Heap Scan on work_item_states  (cost=18968.96..119152.97 rows=803681 width=45)
                                                  Recheck Cond: (disposition = 'cancelled'::text)
                                                  ->  Bitmap Index Scan on idx_work_item_states_disposition  (cost=0.00..18768.04 rows=803681 width=0)
                                                        Index Cond: (disposition = 'cancelled'::text)

Only LIMIT execution plan 仅LIMIT执行计划

Limit  (cost=1.11..69.52 rows=50 width=45)
  ->  Subquery Scan on s  (cost=1.11..1099599.17 rows=803628 width=45)
        ->  Unique  (cost=1.11..1091562.89 rows=803628 width=77)
              ->  WindowAgg  (cost=1.11..1089553.82 rows=803628 width=77)
                    ->  Merge Join  (cost=1.11..1079508.47 rows=803628 width=37)
                          Merge Cond: (work_items.id = work_item_states.work_item_refer)
                          ->  Index Only Scan using idx_work_items_id on work_items  (cost=0.56..477365.14 rows=4020148 width=37)
                          ->  Index Scan using idx_work_item_states_work_item_refer on work_item_states  (cost=0.56..582047.48 rows=803681 width=37)
                                Filter: (disposition = 'cancelled'::text)

Only ORDER BY execution plan 只有ORDER BY执行计划

Sort  (cost=625547.09..627556.16 rows=803628 width=53)
  Sort Key: s.disposition
  ->  Subquery Scan on s  (cost=479736.16..491790.58 rows=803628 width=53)
        ->  Unique  (cost=479736.16..483754.30 rows=803628 width=53)
              ->  Sort  (cost=479736.16..481745.23 rows=803628 width=53)
                    Sort Key: work_items.id
                    ->  WindowAgg  (cost=136262.98..345979.65 rows=803628 width=53)
                          ->  Hash Join  (cost=136262.98..335934.30 rows=803628 width=45)
                                Hash Cond: (work_items.id = work_item_states.work_item_refer)
                                ->  Seq Scan on work_items  (cost=0.00..106679.48 rows=4020148 width=37)
                                ->  Hash  (cost=119152.97..119152.97 rows=803681 width=45)
                                      ->  Bitmap Heap Scan on work_item_states  (cost=18968.96..119152.97 rows=803681 width=45)
                                            Recheck Cond: (disposition = 'cancelled'::text)
                                            ->  Bitmap Index Scan on idx_work_item_states_disposition  (cost=0.00..18768.04 rows=803681 width=0)
                                                  Index Cond: (disposition = 'cancelled'::text)

Answer 1

You didn't post your execution plans, but I have my crystal ball ready, so I'll have a guess at what's going on. 您没有发布执行计划，但我已经准备好水晶球，因此我可以猜测发生了什么。

In your second, very slow query the optimizer has a bright idea how to make it fast: It scans work_items using the index on id , fetches all the matching rows from work_item_states in a nested loop and filters out everything that does not match work_item_states.disposition = 'cancelled' until it has found 50 distinct result. 在第二个非常慢的查询中，优化器有一个快速实现的聪明主意：它使用id上的索引扫描work_items ，从嵌套循环中的work_item_states获取所有匹配的行，并过滤掉所有与work_item_states.disposition = 'cancelled'不匹配的work_item_states.disposition = 'cancelled'直到找到50个不同的结果。

This is a good idea, but the optimizer does not know that all the rows with work_item_states.disposition = 'cancelled' match work_items with a high id , so it has to scan forever until it has found its 50 rows. 这是个好主意，但是优化器并不知道所有具有work_item_states.disposition = 'cancelled' work_items匹配具有高id work_items ，因此它必须一直扫描直到找到50行。

All the other queries don't allow the planner to choose that strategy because it is only promising if a few rows in work_items.id order will do. 所有其他查询都不允许计划者选择该策略，因为只有在以work_items.id顺序排列几行时，它才有希望。

为什么添加ORDER BY可以大大加快查询速度？

问题描述

1 个解决方案

解决方案1
4 已采纳 2018-12-12 19:20:53

为什么添加ORDER BY可以大大加快查询速度？

问题描述

1 个解决方案

解决方案1 4 已采纳 2018-12-12 19:20:53

解决方案1
4 已采纳 2018-12-12 19:20:53