大约 1000 万行的慢速索引扫描

Question

I have a table with about 10 million entries, which I'm trying to optimize.我有一个包含大约 1000 万个条目的表，我正在尝试对其进行优化。

create table houses
(
    id                                serial                          not null
        constraint houses_pkey
            primary key,
    secondary_id                      text                            not null,
    market                            integer                         not null,
    user_id                           uuid                            not null,
    status                            text      default ‘’::text      not null,
    custom                            boolean   default false,
    constraint houses_unique_constraint
        unique (user_id, market, secondary_id)
);

create index houses_user_index
    on houses (user_id);
create index houses_user_market_index
    on houses (user_id, market);
create index houses_user_status_index
    on houses (user_id, status);

I have an use case, where I want to find all distinct non-null user_id and market combinations with given statuses and if any of the entries have their custom flag set.我有一个用例，我想找到所有不同的非空 user_id 和具有给定状态的市场组合，以及是否有任何条目设置了自定义标志。 I'm using the following query, but it's very slow.我正在使用以下查询，但速度很慢。 Do you have any ideas what I could optimize here?你有什么想法我可以在这里优化吗？ Thank you!谢谢！

postgres=# EXPLAIN ANALYZE VERBOSE SELECT DISTINCT user_id, market, bool_or(custom) 
FROM houses WHERE user_id IS NOT NULL 
AND status=ANY(‘{open, sold}‘) GROUP by user_id, market;
                                                                                   QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1694157.78..1695700.38 rows=154260 width=21) (actual time=9574.290..9704.120 rows=809916 loops=1)
   Output: user_id, market, (bool_or(custom))
   ->  Sort  (cost=1694157.78..1694543.43 rows=154260 width=21) (actual time=9574.289..9625.108 rows=809916 loops=1)
         Output: user_id, market, (bool_or(custom))
         Sort Key: houses.user_id, houses.market, (bool_or(houses.custom))
         Sort Method: external sort  Disk: 24544kB
         ->  GroupAggregate  (cost=0.56..1677700.42 rows=154260 width=21) (actual time=0.396..9290.278 rows=809916 loops=1)
               Output: user_id, market, bool_or(custom)
               Group Key: houses.user_id, houses.market
               ->  Index Scan using houses_user_market_index on public.houses  (cost=0.56..1615726.52 rows=8057507 width=21) (actual time=0.350..8647.480 rows=8114889 loops=1)
                     Output: user_id, market, custom
                     Index Cond: (houses.user_id IS NOT NULL)
                     Filter: (houses.status = ANY (‘{open,sold}’::text[]))
                     Rows Removed by Filter: 892609
 Planning time: 0.889 ms
 Execution time: 9729.300 ms
(16 rows)

I have tried adding more indices to cover the custom field as well, but it doesn't seem to make any difference.我也尝试添加更多索引来覆盖custom字段，但这似乎没有任何区别。

Answer 1

No matter what, you are summarizing over 8 million rows.无论如何，您正在汇总超过 800 万行。 You might be able to improve things but don't expect any magic.你也许可以改进事情，但不要指望任何魔法。

First thing to do is to drop the DISTINCT, as the GROUP BY already renders that choice of columns distinct already (though the planner does not seem to know that).首先要做的是删除 DISTINCT，因为 GROUP BY 已经使该列的选择变得不同（尽管计划者似乎不知道这一点）。 But it looks like that will only save 0.5 seconds.但看起来这只会节省 0.5 秒。

In your existing plan, the index does not provide any usable selectivity.在您现有的计划中，索引不提供任何可用的选择性。 What it does offer is production of the data in an order which suits the GroupAggregate.它提供的是以适合 GroupAggregate 的顺序生成数据。 But it still has to hop all round the table to pull out the additional columns, and I am surprised it finds this an attractive option.但是它仍然必须在桌子周围跳来跳去才能拉出额外的列，我很惊讶它发现这是一个有吸引力的选择。 Perhaps that is because the table data is highly correlated on user_id, so doing this is mostly visiting the table pages in physical order.可能是因为表数据在user_id上高度相关，所以这样做主要是按物理顺序访问表页。

Even if that is the case, it would be better to do an index-only scan, which you can get it to do by having a covering index, which would be on (user_id, market, status, custom) .即使是这种情况，最好进行仅索引扫描，您可以通过覆盖索引来完成它，该索引将在(user_id, market, status, custom)上。 You don't need the INCLUDE feature to have a covering index, so being on v10 is not a problem--you just have to put the columns into the body of the index.您不需要 INCLUDE 功能来拥有覆盖索引，因此在 v10 上不是问题——您只需将列放入索引的正文中。 It has been recommended to put status earlier in the index, but doing that will wreck the ordering property without providing any meaningful selectivity benefit.建议将状态放在索引中的较早位置，但这样做会破坏排序属性，而不会提供任何有意义的选择性优势。

You might get some benefit from parallel execution (but in my case, it is actually a harm not a benefit--maybe due to lousy hardware) by lowering parallel_tuple_cost.通过降低parallel_tuple_cost，您可能会从并行执行中获得一些好处（但在我的情况下，这实际上是有害而不是好处——可能是由于糟糕的硬件）。

大约 1000 万行的慢速索引扫描

问题描述

1 个解决方案

解决方案1
1 2022-05-12 16:43:50

大约 1000 万行的慢速索引扫描

问题描述

1 个解决方案

解决方案1 1 2022-05-12 16:43:50

解决方案1
1 2022-05-12 16:43:50