为什么这个DISTINCT / INNER JOIN / ORDER BY postgresql查询这么慢？

Question

This query takes ~4 seconds to complete: 此查询需要约4秒才能完成：

SELECT DISTINCT "resources_resource"."id",
                  "resources_resource"."heading",
                  "resources_resource"."name",
                  "resources_resource"."old_name",
                  "resources_resource"."clean_name",
                  "resources_resource"."sort_name",
                  "resources_resource"."see_also_id",
                  "resources_resource"."referenced_passages",
                  "resources_resource"."resource_type",
                  "resources_resource"."ord",
                  "resources_resource"."content",
                  "resources_resource"."thumb",
                  "resources_resource"."resource_origin"
  FROM "resources_resource"
  INNER JOIN "resources_passageresource" ON ("resources_resource"."id" = "resources_passageresource"."resource_id")
  WHERE "resources_passageresource"."start_ref" >= 66001001
  ORDER BY "resources_resource"."ord" ASC, "resources_resource"."sort_name" ASC LIMIT 5

By popular request, EXPLAIN ANALYZE: 按流行要求，EXPLAIN ANALYZE：

Limit  (cost=1125.50..1125.68 rows=5 width=803) (actual time=4434.076..4434.557 rows=5 loops=1)
  ->  Unique  (cost=1125.50..1136.91 rows=326 width=803) (actual time=4434.076..4434.557 rows=5 loops=1)
        ->  Sort  (cost=1125.50..1126.32 rows=326 width=803) (actual time=4434.075..4434.075 rows=6 loops=1)
              Sort Key: resources_resource.ord, resources_resource.sort_name, resources_resource.id, resources_resource.heading, resources_resource.name, resources_resource.old_name, resources_resource.clean_name, resources_resource.see_also_id, resources_resource.referenced_passages, resources_resource.resource_type, resources_resource.content, resources_resource.thumb, resources_resource.resource_origin
              Sort Method:  quicksort  Memory: 424kB
              ->  Hash Join  (cost=697.00..1111.89 rows=326 width=803) (actual time=3.453..41.429 rows=424 loops=1)
                    Hash Cond: (resources_passageresource.resource_id = resources_resource.id)
                    ->  Bitmap Heap Scan on resources_passageresource  (cost=10.78..190.19 rows=326 width=4) (actual time=0.107..0.401 rows=424 loops=1)
                          Recheck Cond: (start_ref >= 66001001)
                          ->  Bitmap Index Scan on resources_passageresource_start_ref  (cost=0.00..10.70 rows=326 width=0) (actual time=0.086..0.086 rows=424 loops=1)
                                Index Cond: (start_ref >= 66001001)
                    ->  Hash  (cost=431.32..431.32 rows=2232 width=803) (actual time=3.228..3.228 rows=2232 loops=1)
                          Buckets: 1024  Batches: 2  Memory Usage: 947kB
                          ->  Seq Scan on resources_resource  (cost=0.00..431.32 rows=2232 width=803) (actual time=0.002..1.621 rows=2232 loops=1)
Total runtime: 4435.460 ms

This is ORM-generated SQL. 这是ORM生成的SQL。 I can work in SQL, but I'm definitely not proficient, and the EXPLAIN output here is mystifying to me. 我可以在SQL中工作，但我绝对不会精通，这里的EXPLAIN输出对我来说很神秘。 What about this query is dragging me down? 这个查询怎么样把我拉下来？

UPDATE: @Ybakos identified that the ORDER_BY was causing trouble. 更新： @Ybakos发现ORDER_BY导致了问题。 Removing the ORDER_BY clause altogether helps a bit, but the query still takes 800ms. 完全删除ORDER_BY子句有点帮助，但查询仍然需要800毫秒。 Here's the EXPLAIN ANALYZE, sans ORDER_BY : 这是EXPLAIN ANALYZE，没有ORDER_BY ：

HashAggregate  (cost=1122.49..1125.75 rows=326 width=803) (actual time=787.519..787.559 rows=104 loops=1)
  ->  Hash Join  (cost=697.00..1111.89 rows=326 width=803) (actual time=3.381..7.312 rows=424 loops=1)
        Hash Cond: (resources_passageresource.resource_id = resources_resource.id)
        ->  Bitmap Heap Scan on resources_passageresource  (cost=10.78..190.19 rows=326 width=4) (actual time=0.095..0.686 rows=424 loops=1)
              Recheck Cond: (start_ref >= 66001001)
              ->  Bitmap Index Scan on resources_passageresource_start_ref  (cost=0.00..10.70 rows=326 width=0) (actual time=0.079..0.079 rows=424 loops=1)
                    Index Cond: (start_ref >= 66001001)
        ->  Hash  (cost=431.32..431.32 rows=2232 width=803) (actual time=3.173..3.173 rows=2232 loops=1)
              Buckets: 1024  Batches: 2  Memory Usage: 947kB
              ->  Seq Scan on resources_resource  (cost=0.00..431.32 rows=2232 width=803) (actual time=0.002..1.568 rows=2232 loops=1)
Total runtime: 787.678 ms

Answer 1

It seems to me, DISTINCT has to be used to remove duplicates produced by the join. 在我看来， DISTINCT必须用于删除连接产生的重复项。 So my question is, why produce the duplicates in the first place? 所以我的问题是，为什么要首先制作副本？ I'm not entirely sure what this query's being ORM-generated must imply, but if rewriting it is an option, you could certainly rewrite it in such a way as to prevent duplicates from appearing. 我不完全确定这个ORM生成的查询必须暗示什么，但如果重写它是一个选项，你当然可以重写它以防止重复出现。 For instance, using IN : 例如，使用IN ：

SELECT "resources_resource"."id",
       "resources_resource"."heading",
       "resources_resource"."name",
       "resources_resource"."old_name",
       "resources_resource"."clean_name",
       "resources_resource"."sort_name",
       "resources_resource"."see_also_id",
       "resources_resource"."referenced_passages",
       "resources_resource"."resource_type",
       "resources_resource"."ord",
       "resources_resource"."content",
       "resources_resource"."thumb",
       "resources_resource"."resource_origin"
  FROM "resources_resource"
  WHERE "resources_resource"."id" IN (
        SELECT "resources_passageresource"."resource_id"
          FROM "resources_passageresource"
          WHERE "resources_passageresource"."start_ref" >= 66001001
        )
  ORDER BY "resources_resource"."ord" ASC, "resources_resource"."sort_name" ASC LIMIT 5

or using EXISTS : 或使用EXISTS ：

SELECT "resources_resource"."id",
       "resources_resource"."heading",
       "resources_resource"."name",
       "resources_resource"."old_name",
       "resources_resource"."clean_name",
       "resources_resource"."sort_name",
       "resources_resource"."see_also_id",
       "resources_resource"."referenced_passages",
       "resources_resource"."resource_type",
       "resources_resource"."ord",
       "resources_resource"."content",
       "resources_resource"."thumb",
       "resources_resource"."resource_origin"
  FROM "resources_resource"
  WHERE EXISTS (
        SELECT *
          FROM "resources_passageresource"
          WHERE "resources_passageresource"."resource_id" = "resources_resource"."id"
            AND "resources_passageresource"."start_ref" >= 66001001
        )
  ORDER BY "resources_resource"."ord" ASC, "resources_resource"."sort_name" ASC LIMIT 5

And, of course, if it's acceptable to rewrite the query completely, I would also remove the long table names in front of column names. 当然，如果完全重写查询是可以接受的，我还会删除列名前面的长表名。 Consider the following, for instance (the IN query rewritten): 例如，请考虑以下内容（重写IN查询）：

SELECT "id",
       "heading",
       "name",
       "old_name",
       "clean_name",
       "sort_name",
       "see_also_id",
       "referenced_passages",
       "resource_type",
       "ord",
       "content",
       "thumb",
       "resource_origin"
  FROM "resources_resource"
  WHERE "resources_resource"."id" IN (
        SELECT "resource_id"
          FROM "resources_passageresource"
          WHERE "start_ref" >= 66001001
        )
  ORDER BY "ord" ASC, "sort_name" ASC LIMIT 5

Answer 2

It's the combination of ORDER BY with LIMIT. 它是ORDER BY和LIMIT的组合。

If you don't have an index on (ord, sort_name) then I bet this is the cause of the slow performance. 如果你没有（ord，sort_name）索引，那么我敢打赌这是导致性能下降的原因。 Or perhaps an index on (start_ref, ord, sort_name) is necessary for this particular query. 或者，此特定查询可能需要（start_ref，ord，sort_name）上的索引。 Lastly, due to that join, perhaps have the left/first table be the one upon which your ORDER BY criteria applies. 最后，由于该连接，可能左/第一个表是您的ORDER BY条件适用的表。

Answer 3

That seems like a long time in the JOIN. 这在JOIN中似乎很长一段时间。 The default memory settings in postgresql.conf are too low for any modern computer. postgresql.conf中的默认内存设置对于任何现代计算机来说都太低了。 Have you remembered to bump them up? 你有没有记得碰它们？

为什么这个DISTINCT / INNER JOIN / ORDER BY postgresql查询这么慢？

问题描述

3 个解决方案

解决方案1
2 已采纳 2012-03-29 11:08:26

解决方案2
1 2012-03-29 00:12:30

解决方案3
1 2012-03-29 03:35:52

为什么这个DISTINCT / INNER JOIN / ORDER BY postgresql查询这么慢？

问题描述

3 个解决方案

解决方案1 2 已采纳 2012-03-29 11:08:26

解决方案2 1 2012-03-29 00:12:30

解决方案3 1 2012-03-29 03:35:52

解决方案1
2 已采纳 2012-03-29 11:08:26

解决方案2
1 2012-03-29 00:12:30

解决方案3
1 2012-03-29 03:35:52