Postgres为什么不使用带有Distinct的索引？

Question

I have this table: 我有这张桌子：

CREATE TABLE public.prodhistory (
  curve_id           int4 NOT NULL,
  start_prod_date    date NOT NULL,
  prod_date          date NOT NULL,
  monthly_prod_rate  float4 NOT NULL,
  eff_date           timestamp NOT NULL,
  /* Keys */
  CONSTRAINT prodhistorypk
    PRIMARY KEY (curve_id, prod_date, start_prod_date, eff_date),
  /* Foreign keys */
  CONSTRAINT prodhistory2typecurves_fk
    FOREIGN KEY (curve_id)
    REFERENCES public.typecurves(curve_id)
) WITH (
    OIDS = FALSE
  );

CREATE INDEX prodhistory_idx_curve_id01
  ON public.prodhistory
  (curve_id);

with ~42M rows. 有~42M行。

And I execute this query: 我执行此查询：

SELECT DISTINCT curve_id FROM prodhistory

Which I expect would be very quick, given the index. 考虑到指数，我预计会非常快。 But no, 270 secs. 但不，270秒。 So I explain, and I get: 所以我解释一下，然后我得到：

HashAggregate  (cost=824870.03..824873.08 rows=305 width=4) (actual time=211834.018..211834.097 rows=315 loops=1)   
  Output: curve_id  
  Group Key: prodhistory.curve_id   
  ->  Seq Scan on public.prodhistory  (cost=0.00..718003.22 rows=42746722 width=4) (actual time=12.751..200826.299 rows=43218808 loops=1)   
        Output: curve_id    
Planning time: 0.115 ms 
Execution time: 211848.137 ms

I'm not to experienced in reading these plans, but a Seq Scan on the DB seems bad. 我没有阅读这些计划的经验，但数据库上的Seq Scan似乎很糟糕。

Any thoughts? 有什么想法吗？ I'm sort of stumped. 我有点难过。

Answer 1

This plan is chosen because PostgreSQL thinks it is cheaper. 选择这个计划是因为PostgreSQL认为它更便宜。

You can compare by setting 您可以通过设置进行比较

SET enable_seqscan=off;

and then re-running your EXPLAIN (ANALYZE) statement. 然后重新运行EXPLAIN (ANALYZE)语句。 Compare cost and actual time in both cases and check if PostgreSQL estimated correctly or not. 比较两种情况下的cost和actual time ，并检查PostgreSQL是否正确估计。

If you find that using an Index Scan or Index Only Scan is actually cheaper, you could consider twiddling the cost parameters to match your machine better, eg lower random_page_cost or cpu_index_tuple_cost or raise cpu_tuple_cost . 如果您发现使用Index Scan或Index Only Scan实际上更便宜，您可以考虑使用成本参数来更好地匹配您的机器，例如降低random_page_cost或cpu_index_tuple_cost或提高cpu_tuple_cost 。

Postgres为什么不使用带有Distinct的索引？

问题描述

1 个解决方案

解决方案1
3 2016-07-06 08:39:36

Postgres为什么不使用带有Distinct的索引？

问题描述

1 个解决方案

解决方案1 3 2016-07-06 08:39:36

解决方案1
3 2016-07-06 08:39:36