简体   繁体   English

当索引适用时,PostgreSQL会执行seq_scan。 为什么?

[英]Postgresql does seq_scan when the index applies. Why?

I have a query with a join on a varchar(24) primary key. 我有一个与varchar(24)主键联接的查询。 The reasons for this being a key are legacy and targeted for change. 将此作为关键的原因是遗留的,并且有针对性地进行更改。 However, the postgresql query planner insists on doing a sequential scan which seems unreasonable to me. 但是,PostgreSQL查询计划程序坚持执行顺序扫描,这对我来说似乎不合理。 I back up my claim of "unreasonable" with the fact that "SET enable_seqscan = off" speeds up this query by a factor of 8. 我以“ SET enable_seqscan = off”将查询速度提高了8倍的事实来支持我的说法“不合理”。

I've run "vacuum analyze"; 我运行了“真空分析”; I've played with statistics settings, but have had no luck so far. 我玩过统计设置,但是到目前为止还没有运气。

The query is 查询是

select inventry.id, inventry.count, sum(invenwh.count) 
from invenwh join inventry on inventry.id=invenwh.id
where inventry.product_c='CAT17' 
group by 1, 2;

The following sets up the database for running this query. 下面将设置数据库以运行此查询。

drop table if exists inventry;
drop table if exists inwh;
drop table if exists invenwh;
drop table if exists inprodcategory;

-- Create 50 product categories.
create table inprodcategory as 
select i as id, concat('CAT', lpad(i::text, 2, '0'))::varchar(10) as category
from generate_series(1, 50, 1) as i;

-- Create 245,000 inventory items
create table inventry as 
select 
    concat('ITEM', lpad(i::text, 6, '0'))::varchar(24) as id, 
    concat('Item #', i::text)::varchar(50) as descr_1,
    c.category as product_c,
    (case when random() < 0.05 then (random()*70)::int else 0::int end) as count
from generate_series(1, 245000, 1) as i
    join inprodcategory as c on c.id=(i%50)::int;

-- Create 70 warehouses
create table inwh as 
select concat('WAREHOUSE', lpad(i::text, 2, '0'))::varchar(10) as warehouse
from generate_series(1, 70, 1) as i;

-- Create (ugly) cross-join table with counts/warehouse
create table invenwh as 
select id, warehouse, 
    (case when random() < 0.05 then (random()*10)::int else 0::int end) as count
from inventry, inwh;

create index on invenwh (id);
create index on inventry (id);

After running the above, you can run the query. 运行上述命令后,即可运行查询。 On my hardware with an SSD, i7 and 16gb of ram, it takes 4 seconds, but if I run "set enable_seqscan=off", it takes about 500ms. 在具有SSD,i7和16gb ram的硬件上,这需要4秒钟,但是如果我运行“ set enable_seqscan = off”,则需要500毫秒。

Edit: add explain(analyze, buffers) 编辑:添加说明(分析,缓冲区)

HashAggregate  (cost=449773.25..449822.25 rows=4900 width=19) (actual time=4180.006..4181.092 rows=4900 loops=1)
  Group Key: inventry.id, inventry.count
  Buffers: shared hit=4526 read=121051
  ->  Hash Join  (cost=5058.50..447200.75 rows=343000 width=19) (actual time=1285.800..4086.398 rows=343000 loops=1)
        Hash Cond: ((invenwh.id)::text = (inventry.id)::text)
        Buffers: shared hit=4526 read=121051
        ->  Seq Scan on invenwh  (cost=0.00..291651.00 rows=16807000 width=15) (actual time=0.077..1949.843 rows=16807000 loops=1)
              Buffers: shared hit=2530 read=121051
        ->  Hash  (cost=4997.25..4997.25 rows=4900 width=15) (actual time=48.897..48.897 rows=4900 loops=1)
              Buckets: 1024  Batches: 1  Memory Usage: 230kB
              Buffers: shared hit=1996
              ->  Seq Scan on inventry  (cost=0.00..4997.25 rows=4900 width=15) (actual time=21.903..47.031 rows=4900 loops=1)
                    Filter: ((product_c)::text = 'CAT17'::text)
                    Rows Removed by Filter: 235200
                    Buffers: shared hit=1996
Planning time: 4.266 ms
Execution time: 4181.395 ms

Edit: Specific follow-up questions 编辑:特定的后续问题

Thanks to @a_horse_with_no_name (big thank you!!) it seems like lowering random_page_cost is the thing to do. 感谢@a_horse_with_no_name(非常感谢!),似乎可以降低random_page_cost。 This seems more-or-less in agreement with https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server 这似乎与https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server一致

Q: Is there any benchmark I can run to discover optimal values for random_page_cost? 问:我可以运行任何基准来发现random_page_cost的最佳值吗? In production, I'm on a SCSI disk (LSI MR9260-8i). 在生产中,我在SCSI磁盘(LSI MR9260-8i)上。

Q: I feel like statistics may also be relevant here, but I'm coming up empty on a pg-stats-for-dummies type page on the internet. 问:我觉得这里的统计数据可能也很重要,但是我在互联网上的pg-stats-for-Dummys类型页面上空着。 Any hints on learning about stats? 有关学习统计信息的任何提示?

When the costs estimated by the planner don't match the reality of the execution time, cost settings should be adjusted to better match your hardware. 当计划者估算的成本与执行时间的实际情况不符时,应调整成本设置以更好地与您的硬件相匹配。

The various knobs are documented at Planner Cost Constants . Planner Cost Constants中记录了各种旋钮。

In particular there is this advice on random_page_cost that's relevant to your case: 特别是,有关random_page_cost的以下建议与您的情况有关:

Storage that has a low random read cost relative to sequential, eg solid-state drives, might also be better modeled with a lower value for random_page_cost. 相对于顺序驱动(例如固态驱动器)而言,具有较低随机读取成本的存储也可以通过为random_page_cost设置较低值来更好地建模。

See also Random Page Cost Revisited for more tuning advice on this parameter with 5 different storage types. 另请参阅“ 随机页面成本”,以获取有关此参数的5种不同存储类型的更多调整建议。

TL;DR: for an SSD, try first 1.5 for random_page_cost . TL; DR:对于SSD,请先尝试使用1.5作为random_page_cost

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM