简体   繁体   English

PostgreSQL查询没有使用索引

[英]PostgreSQL query is not using an index

Enviroment 环境

My PostgreSQL (9.2) schema looks like this: 我的PostgreSQL(9.2)架构如下所示:

CREATE TABLE first
(
   id_first bigint NOT NULL,
   first_date timestamp without time zone NOT NULL,
   CONSTRAINT first_pkey PRIMARY KEY (id_first)
)
WITH (
   OIDS=FALSE
);

CREATE INDEX first_first_date_idx
   ON first
   USING btree
     (first_date);

CREATE TABLE second
(
   id_second bigint NOT NULL,
   id_first bigint NOT NULL,
   CONSTRAINT second_pkey PRIMARY KEY (id_second),
   CONSTRAINT fk_first FOREIGN KEY (id_first)
      REFERENCES first (id_first) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH (
   OIDS=FALSE
);

CREATE INDEX second_id_first_idx
   ON second
   USING btree
   (id_first);

CREATE TABLE third
(
   id_third bigint NOT NULL,
   id_second bigint NOT NULL,
   CONSTRAINT third_pkey PRIMARY KEY (id_third),
   CONSTRAINT fk_second FOREIGN KEY (id_second)
      REFERENCES second (id_second) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH (
   OIDS=FALSE
);

CREATE INDEX third_id_second_idx
   ON third
   USING btree
   (id_second);

So, I have 3 tables with own PK. 所以,我有3张桌子,有自己的PK。 First has an index on first_date , Second has a FK from First and index on it. Firstfirst_date上有一个索引, SecondFirst上有一个FK,在它上面有索引。 Third as a FK from Second and index on it aswell: Third作为FK从Second和它的索引以及:

 First (0 --> n) Second (0 --> n) Third

First table contains about 10 000 000 records. First表包含大约10 000 000条记录。 Second table contains about 20 000 000 records. Second表包含约20 000 000条记录。 Third table contains about 18 000 000 records. Third表包含约18 000 000条记录。

Date range in column first_date is from 2016-01-01 till today. first_date列中的日期范围是从2016-01-01到今天。

random_cost_page is set to 2.0 . random_cost_page设置为2.0 default_statistics_target is set to 100 . default_statistics_target设置为100 All FK , PK and first_date STATISTICS are set to 5000 所有FKPKfirst_date STATISTICS都设置为5000

Task to do 要做的任务

I want to count all Third rows connected with First , where first_date < X 我想计算所有与First连接的Third行,其中first_date < X

My query: 我的查询:

SELECT count(t.id_third) AS count
FROM first f
JOIN second s ON s.id_first = f.id_first 
JOIN third t ON t.id_second = s.id_second
WHERE first_date < _my_date

Problem description 问题描述

  • Asking for 2 days - _my_date = '2016-01-03' 要求2天 - _my_date = '2016-01-03'

Everything working pretty well. 一切都很好。 Query lasts 1-2 seconds. 查询持续1-2秒。 EXPLAIN ANALYZE : EXPLAIN ANALYZE

"Aggregate  (cost=8585512.55..8585512.56 rows=1 width=8) (actual time=67.310..67.310 rows=1 loops=1)"
"  ->  Merge Join  (cost=4208477.49..8583088.04 rows=969805 width=8) (actual time=44.277..65.948 rows=17631 loops=1)"
"        Merge Cond: (s.id_second = t.id_second)"
"        ->  Sort  (cost=4208477.48..4211121.75 rows=1057709 width=8) (actual time=44.263..46.035 rows=19230 loops=1)"
"              Sort Key: s.id_second"
"              Sort Method: quicksort  Memory: 1670kB"
"              ->  Nested Loop  (cost=0.01..4092310.41 rows=1057709 width=8) (actual time=6.169..39.183 rows=19230 loops=1)"
"                    ->  Index Scan using first_first_date_idx on first f  (cost=0.01..483786.81 rows=492376 width=8)  (actual time=6.159..12.223 rows=10346 loops=1)"
"                          Index Cond: (first_date < '2016-01-03 00:00:00'::timestamp without time zone)"
"                    ->  Index Scan using second_id_first_idx on second s  (cost=0.00..7.26 rows=7 width=16) (actual time=0.002..0.002 rows=2 loops=10346)"
"                          Index Cond: (id_first = f.id_first)"
"        ->  Index Scan using third_id_second_idx on third t  (cost=0.00..4316649.89 rows=17193788 width=16) (actual time=0.008..7.293 rows=17632 loops=1)"
"Total runtime: 67.369 ms"
  • Asking for 10 days or more - _my_date = '2016-01-11' or more 要求10天或更长时间 - _my_date = '2016-01-11'或更多

Query is not using a indexscan anymore - replaced by seqscan and last 3-4 minutes... Query plan: 查询不再使用indexscan - 由seqscan替换,持续3-4分钟...查询计划:

"Aggregate  (cost=8731468.75..8731468.76 rows=1 width=8) (actual time=234411.229..234411.229 rows=1 loops=1)"
"  ->  Hash Join  (cost=4352424.81..8728697.88 rows=1108348 width=8) (actual time=189670.068..234400.540 rows=138246 loops=1)"
"        Hash Cond: (t.id_second = o.id_second)"
"        ->  Seq Scan on third t  (cost=0.00..4128080.88 rows=17193788 width=16) (actual time=0.016..124111.453 rows=17570724 loops=1)"
"        ->  Hash  (cost=4332592.69..4332592.69 rows=1208810 width=8) (actual time=98566.740..98566.740 rows=151263 loops=1)"
"              Buckets: 16384  Batches: 16  Memory Usage: 378kB"
"              ->  Hash Join  (cost=561918.25..4332592.69 rows=1208810 width=8) (actual time=6535.801..98535.915 rows=151263 loops=1)"
"                    Hash Cond: (s.id_first = f.id_first)"
"                    ->  Seq Scan on second s  (cost=0.00..3432617.48 rows=18752248 width=16) (actual time=6090.771..88891.691 rows=19132869 loops=1)"
"                    ->  Hash  (cost=552685.31..552685.31 rows=562715 width=8) (actual time=444.630..444.630 rows=81650 loops=1)"
"                          ->  Index Scan using first_first_date_idx on first f  (cost=0.01..552685.31 rows=562715 width=8) (actual time=7.987..421.087 rows=81650 loops=1)"
"                                Index Cond: (first_date < '2016-01-13 00:00:00'::timestamp without time zone)"
"Total runtime: 234411.303 ms"

For test purposes, I have set: 出于测试目的,我已设置:

 SET enable_seqscan = OFF;

My queries start using indexscan again and last for 1-10 s (depends on range). 我的查询再次开始使用indexscan并持续1-10秒(取决于范围)。

Question

Why this is working like that? 为什么这样的工作呢? How to convince a Query Planner to use a indexscan ? 如何说服Query Planner使用indexscan

EDIT 编辑

After reducing a random_page_cost to 1.1 , I can select about 30 days now still using a indexscan . random_page_cost减少到1.1 ,我现在仍然可以使用indexscan选择大约30天。 Query plan changed a little bit: 查询计划有所改变:

"Aggregate  (cost=8071389.47..8071389.48 rows=1 width=8) (actual  time=4915.196..4915.196 rows=1 loops=1)"
"  ->  Nested Loop  (cost=0.01..8067832.28 rows=1422878 width=8) (actual time=14.402..4866.937 rows=399184 loops=1)"
"        ->  Nested Loop  (cost=0.01..3492321.55 rows=1551849 width=8) (actual time=14.393..3012.617 rows=436794 loops=1)"
"              ->  Index Scan using first_first_date_idx on first f  (cost=0.01..432541.99 rows=722404 width=8) (actual time=14.372..729.233 rows=236007 loops=1)"
"                    Index Cond: (first_date < '2016-02-01 00:00:00'::timestamp without time zone)"
"              ->  Index Scan using second_id_first_idx on second s  (cost=0.00..4.17 rows=7 width=16) (actual time=0.008..0.009 rows=2 loops=236007)"
"                    Index Cond: (second = f.id_second)"
"        ->  Index Scan using third_id_second_idx on third t  (cost=0.00..2.94 rows=1 width=16) (actual time=0.004..0.004 rows=1 loops=436794)"
"              Index Cond: (id_second = s.id_second)"
"Total runtime: 4915.254 ms"

However, I still don get it why asking for more couse a seqscan ... 但是,我仍然不明白为什么要求更多的seqscan ......

Iteresting is that, when I ask for range just above some kind of limit I getting a Query plan like this (here select for 40 days - asking for more will produce full seqscan again): 有趣的是,当我要求范围超出某种限制时,我得到这样的查询计划(这里选择40天 - 要求更多将再次生成完整的seqscan ):

"Aggregate  (cost=8403399.27..8403399.28 rows=1 width=8) (actual time=138303.216..138303.217 rows=1 loops=1)"
"  ->  Hash Join  (cost=3887619.07..8399467.63 rows=1572656 width=8) (actual time=44056.443..138261.203 rows=512062 loops=1)"
"        Hash Cond: (t.id_second = s.id_second)"
"        ->  Seq Scan on third t  (cost=0.00..4128080.88 rows=17193788 width=16) (actual time=0.004..119497.056 rows=17570724 loops=1)"
"        ->  Hash  (cost=3859478.04..3859478.04 rows=1715203 width=8) (actual time=5695.077..5695.077 rows=560503 loops=1)"
"              Buckets: 16384  Batches: 16  Memory Usage: 1390kB"
"              ->  Nested Loop  (cost=0.01..3859478.04 rows=1715203 width=8) (actual time=65.250..5533.413 rows=560503 loops=1)"
"                    ->  Index Scan using first_first_date_idx on first f  (cost=0.01..477985.28 rows=798447 width=8) (actual time=64.927..1688.341 rows=302663 loops=1)"
"                          Index Cond: (first_date < '2016-02-11 00:00:00'::timestamp without time zone)"
"                    ->  Index Scan using second_id_first_idx on second s (cost=0.00..4.17 rows=7 width=16) (actual time=0.010..0.012 rows=2 loops=302663)"
"                          Index Cond: (id_first = f.id_first)"
"Total runtime: 138303.306 ms"

UPDATE after Laurenz Able suggestions Laurenz Able建议之后更新

After rewritting a query plan as Laurenz Able suggested: 在重写了Laurenz Able建议的查询计划之后:

"Aggregate  (cost=9102321.05..9102321.06 rows=1 width=8) (actual time=15237.830..15237.830 rows=1 loops=1)"
"  ->  Merge Join  (cost=4578171.25..9097528.19 rows=1917143 width=8) (actual time=9111.694..15156.092 rows=803657 loops=1)"
"        Merge Cond: (third.id_second = s.id_second)"
"        ->  Index Scan using third_id_second_idx on third  (cost=0.00..4270478.19 rows=17193788 width=16) (actual time=23.650..5425.137 rows=803658 loops=1)"
"        ->  Materialize  (cost=4577722.81..4588177.38 rows=2090914 width=8) (actual time=9088.030..9354.326 rows=879283 loops=1)"
"              ->  Sort  (cost=4577722.81..4582950.09 rows=2090914 width=8) (actual time=9088.023..9238.426 rows=879283 loops=1)"
"                    Sort Key: s.id_second"
"                    Sort Method: external sort  Disk: 15480kB"
"                    ->  Merge Join  (cost=673389.38..4341477.37 rows=2090914 width=8) (actual time=3662.239..8485.768 rows=879283 loops=1)"
"                          Merge Cond: (s.id_first = f.id_first)"
"                          ->  Index Scan using second_id_first_idx on second s  (cost=0.00..3587838.88 rows=18752248 width=16) (actual time=0.015..4204.308 rows=879284 loops=1)"
"                          ->  Materialize  (cost=672960.82..677827.55 rows=973345 width=8) (actual time=3662.216..3855.667 rows=892988 loops=1)"
"                                ->  Sort  (cost=672960.82..675394.19 rows=973345 width=8) (actual time=3662.213..3745.975 rows=476519 loops=1)"
"                                      Sort Key: f.id_first"
"                                      Sort Method: external sort  Disk: 8400kB"
"                                      ->  Index Scan using first_first_date_idx on first f (cost=0.01..568352.90 rows=973345 width=8) (actual time=126.386..3233.134 rows=476519 loops=1)"
"                                            Index Cond: (first_date < '2016-03-03 00:00:00'::timestamp without time zone)"
"Total runtime: 15244.404 ms"

First, it looks like some of the estimates are off. 首先,看起来有些估计是关闭的。
Try to ANALYZE the tables and see if that changes the query plan chosen. 尝试ANALYZE表并查看是否更改了所选的查询计划。

What might also help is to lower random_page_cost to a value just over 1 and see if that improves the plan. 也可能有帮助的是将random_page_cost降低到刚好超过1的值,看看是否能改善计划。

It is interesting to note that the index scan on third_id_second_idx in the fast query produces only 17632 rows instead of over 17 million, which I can only explain by assuming that from that row on, the values of id_second no longer match any row in the join of first and second , ie the merge join is completed after that. 有趣的是,快速查询中的third_id_second_idx上的索引扫描仅产生17632行而不是超过1700万行,我只能通过假设从该行开始, id_second的值不再匹配连接中的任何行来解释。 firstsecond ,即合并连接在此之后完成。

You can try to exploit that with with a rewritten query. 您可以尝试使用重写的查询来利用它。 Try 尝试

JOIN (SELECT id_second, id_third FROM third ORDER BY id_second) t

instead of 代替

JOIN third t

That may result in a better plan since PostgreSQL won't optimize the ORDER BY away, and the planner may decide that since it has to sort third anyway, it may be cheaper to use a merge join. 这可能会导致一个更好的计划,因为PostgreSQL不会优化ORDER BY ,并且计划者可能会决定,因为它必须排序third ,所以使用合并连接可能更便宜。 That way you trick the planner into choosing a plan that it wouldn't recognize as ideal. 这样你就会欺骗计划者选择一个它不会被认为理想的计划。 With a different value distribution the planner's original choice would probably be better. 通过不同的价值分配,规划者的原始选择可能会更好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM