简体   繁体   English

为什么这个简单的查询不使用 postgres 中的索引?

[英]Why does this simple query not use the index in postgres?

In my postgreSQL database I have a table named "product" .在我的 postgreSQL 数据库中,我有一个名为"product"的表。 In this table I have a column named "date_touched" with type timestamp .在此表中,我有一个名为"date_touched"的列,类型为timestamp I created a simple btree index on this column.我在此列上创建了一个简单的 btree 索引。 This is the schema of my table (I omitted irrelevant column & index definitions):这是我的表的架构(我省略了不相关的列和索引定义):

                                           Table "public.product"
          Column           |           Type           | Modifiers                              
---------------------------+--------------------------+-------------------
 id                        | integer                  | not null default nextval('product_id_seq'::regclass)
 date_touched              | timestamp with time zone | not null

Indexes:
    "product_pkey" PRIMARY KEY, btree (id)
    "product_date_touched_59b16cfb121e9f06_uniq" btree (date_touched)

The table has ~300,000 rows and I want to get the n-th element from the table ordered by "date_touched" .该表有 ~300,000 行,我想从按"date_touched"排序的表中获取第 n 个元素。 when I want to get the 1000th element, it takes 0.2s, but when I want to get the 100,000th element, it takes about 6s.当我要获取第1000个元素时,需要0.2s,但是当我要获取第100,000个元素时,大约需要6s。 My question is, why does it take too much time to retrieve the 100,000th element, although I've defined a btree index?我的问题是,为什么我已经定义了 btree 索引,但检索第 100,000 个元素需要花费太多时间?

Here is my query with explain analyze that shows postgreSQL does not use the btree index and instead sorts all rows to find the 100,000th element:这是我的explain analyze查询,显示 postgreSQL 不使用 btree 索引而是对所有行进行排序以查找第 100,000 个元素:

  • first query (100th element):第一个查询(第 100 个元素):
explain analyze
  SELECT product.id
  FROM product
  ORDER BY product.date_touched ASC
  LIMIT 1
  OFFSET 1000;
                                QUERY PLAN
-----------------------------------------------------------------------------------------------------
Limit  (cost=3035.26..3038.29 rows=1 width=12) (actual time=160.208..160.209 rows=1 loops=1)
->  Index Scan using product_date_touched_59b16cfb121e9f06_uniq on product  (cost=0.42..1000880.59 rows=329797 width=12) (actual time=16.651..159.766 rows=1001 loops=1)
Total runtime: 160.395 ms
  • second query (100,000th element):第二个查询(第 100,000 个元素):
explain analyze
  SELECT product.id
  FROM product
  ORDER BY product.date_touched ASC
  LIMIT 1
  OFFSET 100000;
                           QUERY PLAN                         
------------------------------------------------------------------------------------------------------
 Limit  (cost=106392.87..106392.88 rows=1 width=12) (actual time=6621.947..6621.950 rows=1 loops=1)
   ->  Sort  (cost=106142.87..106967.37 rows=329797 width=12) (actual time=6381.174..6568.802 rows=100001 loops=1)
         Sort Key: date_touched
         Sort Method: external merge  Disk: 8376kB
         ->  Seq Scan on product  (cost=0.00..64637.97 rows=329797 width=12) (actual time=1.357..4184.115 rows=329613 loops=1)
 Total runtime: 6629.903 ms

It is a very good thing, that SeqScan is used here.这是一件非常好的事情,这里使用了 SeqScan。 Your OFFSET 100000 is not a good thing for the IndexScan.您的OFFSET 100000对 IndexScan 来说不是一件好事。

A bit of theory一点理论

Btree indexes contain 2 structures inside: Btree 索引内部包含 2 个结构:

  1. balanced tree and平衡树和
  2. double-linked list of keys.键的双链表。

First structure allows for fast keys lookups, second is responsible for the ordering.第一个结构允许快速键查找,第二个结构负责排序。 For bigger tables, linked list cannot fit into a single page and therefore it is a list of linked pages, where each page's entries maintain ordering, specified during index creation.对于更大的表,链接列表无法放入单个页面,因此它是链接页面的列表,其中每个页面的条目保持顺序,在索引创建期间指定。

It is wrong to think, though, that such pages are sitting together on the disk.但是,认为这些页面一起位于磁盘上的想法是错误的。 In fact, it is more probable that those are spread across different locations.事实上,它们更有可能分布在不同的位置。 And in order to read pages based on the index's order , system has to perform random disk reads.为了根据索引的顺序读取页面,系统必须执行随机磁盘读取。 Random disk IO is expensive, compared to sequential access.与顺序访问相比,随机磁盘 IO 是昂贵的。 Therefore good optimizer will prefer a SeqScan instead.因此,好的优化器会更喜欢SeqScan

I highly recommend “SQL Performance Explained” book to better understand indexes.我强烈推荐“SQL Performance Explained”一书以更好地理解索引。 It is also available on-line .它也可以在线获得

What is going on?到底是怎么回事?

Your OFFSET clause would cause database to read index's linked list of keys (causing lots of random disk reads) and than discarding all those results, till you hit the wanted offset.您的OFFSET子句会导致数据库读取索引的键链接列表(导致大量随机磁盘读取),而不是丢弃所有这些结果,直到您达到想要的偏移量。 And it is good, in fact, that Postgres decided to use SeqScan + Sort here — this should be faster.事实上,Postgres 决定在这里使用SeqScan + Sort是件好事——这应该会更快。

You can check this assumption by:您可以通过以下方式检查此假设:

  • running EXPLAIN (analyze, buffers) of your big- OFFSET query运行你的 big- OFFSET查询的EXPLAIN (analyze, buffers)
  • than do SET enable_seqscan TO 'off';SET enable_seqscan TO 'off';
  • and run EXPLAIN (analyze, buffers) again, comparing the results.然后再次运行EXPLAIN (analyze, buffers) ,比较结果。

In general, it is better to avoid OFFSET , as DBMSes not always pick the right approach here.一般来说,最好避免使用OFFSET ,因为 DBMS 在这里并不总是选择正确的方法。 (BTW, which version of PostgreSQL you're using?) Here's a comparison of how it performs for different offset values. (顺便说一句,您使用的是哪个版本的 PostgreSQL?)下面是它对不同偏移值的执行情况的比较


EDIT: In order to avoid OFFSET one would have to base pagination on the real data, that exists in the table and is a part of the index.编辑:为了避免OFFSET ,必须将分页基于表中存在的真实数据,并且是索引的一部分。 For this particular case, the following might be possible:对于这种特殊情况,可能会出现以下情况:

  • show first N (say, 20) elements显示前 N(比如 20)个元素
  • include maximal date_touched that is shown on the page to all the “Next” links.包括页面上显示的所有“下一步”链接的最大date_touched You can compute this value on the application side.您可以在应用程序端计算此值。 Do similar for the “Previous” links, except include minimal date_touch for these.对“Previous”链接执行类似的操作,除了为这些链接包含最小的date_touch
  • on the server side you will get the limiting value.在服务器端,您将获得限制值。 Therefore, say for the “Next” case, you can do a query like this:因此,对于“下一个”案例,您可以执行如下查询:
SELECT id
  FROM product
 WHERE date_touched > $max_date_seen_on_the_page
 ORDER BY date_touched ASC
 LIMIT 20;

This query makes best use of the index.此查询充分利用了索引。

Of course, you can adjust this example to your needs.当然,您可以根据需要调整此示例。 I used pagination as it is a typical case for the OFFSET .我使用分页,因为它是OFFSET的典型案例。

One more note — querying 1 row many times, increasing offset for each query by 1, will be much more time consuming, than doing a single batch query that returns all those records, which are then iterated from on the application side.还有一点要注意——多次查询 1 行,将每个查询的偏移量增加 1,这比执行返回所有这些记录的单个批查询要耗时得多,然后从应用程序端迭代这些记录。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM