[英]Why is this DISTINCT/INNER JOIN/ORDER BY postgresql query so slow?
This query takes ~4 seconds to complete: 此查询需要约4秒才能完成:
SELECT DISTINCT "resources_resource"."id",
"resources_resource"."heading",
"resources_resource"."name",
"resources_resource"."old_name",
"resources_resource"."clean_name",
"resources_resource"."sort_name",
"resources_resource"."see_also_id",
"resources_resource"."referenced_passages",
"resources_resource"."resource_type",
"resources_resource"."ord",
"resources_resource"."content",
"resources_resource"."thumb",
"resources_resource"."resource_origin"
FROM "resources_resource"
INNER JOIN "resources_passageresource" ON ("resources_resource"."id" = "resources_passageresource"."resource_id")
WHERE "resources_passageresource"."start_ref" >= 66001001
ORDER BY "resources_resource"."ord" ASC, "resources_resource"."sort_name" ASC LIMIT 5
By popular request, EXPLAIN ANALYZE: 按流行要求,EXPLAIN ANALYZE:
Limit (cost=1125.50..1125.68 rows=5 width=803) (actual time=4434.076..4434.557 rows=5 loops=1)
-> Unique (cost=1125.50..1136.91 rows=326 width=803) (actual time=4434.076..4434.557 rows=5 loops=1)
-> Sort (cost=1125.50..1126.32 rows=326 width=803) (actual time=4434.075..4434.075 rows=6 loops=1)
Sort Key: resources_resource.ord, resources_resource.sort_name, resources_resource.id, resources_resource.heading, resources_resource.name, resources_resource.old_name, resources_resource.clean_name, resources_resource.see_also_id, resources_resource.referenced_passages, resources_resource.resource_type, resources_resource.content, resources_resource.thumb, resources_resource.resource_origin
Sort Method: quicksort Memory: 424kB
-> Hash Join (cost=697.00..1111.89 rows=326 width=803) (actual time=3.453..41.429 rows=424 loops=1)
Hash Cond: (resources_passageresource.resource_id = resources_resource.id)
-> Bitmap Heap Scan on resources_passageresource (cost=10.78..190.19 rows=326 width=4) (actual time=0.107..0.401 rows=424 loops=1)
Recheck Cond: (start_ref >= 66001001)
-> Bitmap Index Scan on resources_passageresource_start_ref (cost=0.00..10.70 rows=326 width=0) (actual time=0.086..0.086 rows=424 loops=1)
Index Cond: (start_ref >= 66001001)
-> Hash (cost=431.32..431.32 rows=2232 width=803) (actual time=3.228..3.228 rows=2232 loops=1)
Buckets: 1024 Batches: 2 Memory Usage: 947kB
-> Seq Scan on resources_resource (cost=0.00..431.32 rows=2232 width=803) (actual time=0.002..1.621 rows=2232 loops=1)
Total runtime: 4435.460 ms
This is ORM-generated SQL. 这是ORM生成的SQL。 I can work in SQL, but I'm definitely not proficient, and the EXPLAIN output here is mystifying to me.
我可以在SQL中工作,但我绝对不会精通,这里的EXPLAIN输出对我来说很神秘。 What about this query is dragging me down?
这个查询怎么样把我拉下来?
UPDATE: @Ybakos identified that the ORDER_BY
was causing trouble. 更新: @Ybakos发现
ORDER_BY
导致了问题。 Removing the ORDER_BY
clause altogether helps a bit, but the query still takes 800ms. 完全删除
ORDER_BY
子句有点帮助,但查询仍然需要800毫秒。 Here's the EXPLAIN ANALYZE, sans ORDER_BY
: 这是EXPLAIN ANALYZE,没有
ORDER_BY
:
HashAggregate (cost=1122.49..1125.75 rows=326 width=803) (actual time=787.519..787.559 rows=104 loops=1)
-> Hash Join (cost=697.00..1111.89 rows=326 width=803) (actual time=3.381..7.312 rows=424 loops=1)
Hash Cond: (resources_passageresource.resource_id = resources_resource.id)
-> Bitmap Heap Scan on resources_passageresource (cost=10.78..190.19 rows=326 width=4) (actual time=0.095..0.686 rows=424 loops=1)
Recheck Cond: (start_ref >= 66001001)
-> Bitmap Index Scan on resources_passageresource_start_ref (cost=0.00..10.70 rows=326 width=0) (actual time=0.079..0.079 rows=424 loops=1)
Index Cond: (start_ref >= 66001001)
-> Hash (cost=431.32..431.32 rows=2232 width=803) (actual time=3.173..3.173 rows=2232 loops=1)
Buckets: 1024 Batches: 2 Memory Usage: 947kB
-> Seq Scan on resources_resource (cost=0.00..431.32 rows=2232 width=803) (actual time=0.002..1.568 rows=2232 loops=1)
Total runtime: 787.678 ms
It seems to me, DISTINCT
has to be used to remove duplicates produced by the join. 在我看来,
DISTINCT
必须用于删除连接产生的重复项。 So my question is, why produce the duplicates in the first place? 所以我的问题是,为什么要首先制作副本? I'm not entirely sure what this query's being ORM-generated must imply, but if rewriting it is an option, you could certainly rewrite it in such a way as to prevent duplicates from appearing.
我不完全确定这个ORM生成的查询必须暗示什么,但如果重写它是一个选项,你当然可以重写它以防止重复出现。 For instance, using
IN
: 例如,使用
IN
:
SELECT "resources_resource"."id",
"resources_resource"."heading",
"resources_resource"."name",
"resources_resource"."old_name",
"resources_resource"."clean_name",
"resources_resource"."sort_name",
"resources_resource"."see_also_id",
"resources_resource"."referenced_passages",
"resources_resource"."resource_type",
"resources_resource"."ord",
"resources_resource"."content",
"resources_resource"."thumb",
"resources_resource"."resource_origin"
FROM "resources_resource"
WHERE "resources_resource"."id" IN (
SELECT "resources_passageresource"."resource_id"
FROM "resources_passageresource"
WHERE "resources_passageresource"."start_ref" >= 66001001
)
ORDER BY "resources_resource"."ord" ASC, "resources_resource"."sort_name" ASC LIMIT 5
or using EXISTS
: 或使用
EXISTS
:
SELECT "resources_resource"."id",
"resources_resource"."heading",
"resources_resource"."name",
"resources_resource"."old_name",
"resources_resource"."clean_name",
"resources_resource"."sort_name",
"resources_resource"."see_also_id",
"resources_resource"."referenced_passages",
"resources_resource"."resource_type",
"resources_resource"."ord",
"resources_resource"."content",
"resources_resource"."thumb",
"resources_resource"."resource_origin"
FROM "resources_resource"
WHERE EXISTS (
SELECT *
FROM "resources_passageresource"
WHERE "resources_passageresource"."resource_id" = "resources_resource"."id"
AND "resources_passageresource"."start_ref" >= 66001001
)
ORDER BY "resources_resource"."ord" ASC, "resources_resource"."sort_name" ASC LIMIT 5
And, of course, if it's acceptable to rewrite the query completely, I would also remove the long table names in front of column names. 当然,如果完全重写查询是可以接受的,我还会删除列名前面的长表名。 Consider the following, for instance (the
IN
query rewritten): 例如,请考虑以下内容(重写
IN
查询):
SELECT "id",
"heading",
"name",
"old_name",
"clean_name",
"sort_name",
"see_also_id",
"referenced_passages",
"resource_type",
"ord",
"content",
"thumb",
"resource_origin"
FROM "resources_resource"
WHERE "resources_resource"."id" IN (
SELECT "resource_id"
FROM "resources_passageresource"
WHERE "start_ref" >= 66001001
)
ORDER BY "ord" ASC, "sort_name" ASC LIMIT 5
It's the combination of ORDER BY with LIMIT. 它是ORDER BY和LIMIT的组合。
If you don't have an index on (ord, sort_name) then I bet this is the cause of the slow performance. 如果你没有(ord,sort_name)索引,那么我敢打赌这是导致性能下降的原因。 Or perhaps an index on (start_ref, ord, sort_name) is necessary for this particular query.
或者,此特定查询可能需要(start_ref,ord,sort_name)上的索引。 Lastly, due to that join, perhaps have the left/first table be the one upon which your ORDER BY criteria applies.
最后,由于该连接,可能左/第一个表是您的ORDER BY条件适用的表。
That seems like a long time in the JOIN. 这在JOIN中似乎很长一段时间。 The default memory settings in postgresql.conf are too low for any modern computer.
postgresql.conf中的默认内存设置对于任何现代计算机来说都太低了。 Have you remembered to bump them up?
你有没有记得碰它们?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.