Sql - 使用连接从大表 (100M+) 中获取记录时查询速度慢，提示？

Question

Working on improving the query below on some large tables (Im using Postgres v12.4):致力于在一些大表上改进以下查询（我使用的是 Postgres v12.4）：

people : 137 Million records
delimiters: 1.2 Million records
person_delimiters: 329 Million records

SELECT "delimiters".*, "a"."person_id" FROM "delimiters" 
INNER JOIN "person_delimiters" as "a" on "delimiters"."id" = "a"."delimiter_id" 
WHERE ("a"."person_id" IN (SELECT id FROM "people" LIMIT 1000));

(the subquery using LIMIT 1000 is here for generic purposes, on its real application though, I get specific sets of 1000 person ids) （使用 LIMIT 1000 的子查询是出于通用目的，但在其实际应用程序中，我得到了 1000 个人 ID 的特定集合）

person_delimiters is an intermediary table that has two columns ( person_id , delimiter_id ); person_delimiters是一个中间表，有两列（ person_id ， delimiter_id ）；

Output from EXPLAIN ANALYZE:解释分析的输出：

 Hash Join  (cost=46025.12..11207871.13 rows=164750752 width=2346) (actual time=65659.597..66354.997 rows=1044 loops=1)
   Hash Cond: (a.delimiter_id = delimiters.id)
   ->  Hash Semi Join  (cost=57.96..8456389.56 rows=164750752 width=32) (actual time=20.582..64777.963 rows=1044 loops=1)
         Hash Cond: (a.person_id = people.id)
         ->  Seq Scan on person_delimiters a  (cost=0.00..5758538.04 rows=329501504 width=32) (actual time=0.008..30865.854 rows=329501518 loops=1)
         ->  Hash  (cost=45.46..45.46 rows=1000 width=16) (actual time=0.384..0.385 rows=1000 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 55kB
               ->  Limit  (cost=0.00..35.46 rows=1000 width=16) (actual time=0.004..0.241 rows=1000 loops=1)
                     ->  Seq Scan on people  (cost=0.00..4888821.40 rows=137873840 width=16) (actual time=0.003..0.158 rows=1000 loops=1)
   ->  Hash  (cost=24508.85..24508.85 rows=71385 width=2330) (actual time=839.841..839.841 rows=1227076 loops=1)
         Buckets: 2048  Batches: 64  Memory Usage: 3015kB
         ->  Seq Scan on delimiters  (cost=0.00..24508.85 rows=71385 width=2330) (actual time=0.007..303.814 rows=1227076 loops=1)
 Planning Time: 1.197 ms
 Execution Time: 66355.110 ms

Indexes:索引：

-- person_delimiters --
 public     | person_delimiters | person_delimiters_pkey                       |            | CREATE UNIQUE INDEX person_delimiters_pkey ON public.person_delimiters USING btree (
person_id, delimiter_id)
 public     | person_delimiters | idx_person_delimiters_delimiter_id_person_id |            | CREATE INDEX idx_person_delimiters_delimiter_id_person_id ON public.person_delimiter
s USING btree (delimiter_id, person_id)

-- people -- 
 public     | people    | people_two_pkey                            |            | CREATE UNIQUE INDEX people_pkey ON public.people USING btree (id)

-- delimiters -- 
 public     | delimiters | delimiters_pkey                            |            | CREATE UNIQUE INDEX delimiters_pkey ON public.delimiters USING btree (id)

Anything I could work on, to optimize it?我可以做些什么来优化它？

Answer 1

it normally works best to start off with the smallest possible table/dataset and then work up from there.通常最好从尽可能小的表/数据集开始，然后从那里开始工作。 The following is probably the most efficient way of approaching the problem in SQL (the 2nd CTE is really there to clarify the approach - you could probably move the logic into the main select statement without affecting much).以下可能是解决 SQL 中问题的最有效方法（第二个 CTE 确实在那里澄清了该方法 - 您可以将逻辑移到主 select 语句中而不会产生太大影响）。

If this doesn't improve the performance significantly then you are probably looking at indexing/partitioning/temporary table solutions.如果这不会显着提高性能，那么您可能正在寻找索引/分区/临时表解决方案。

with person_id_list as 
(
    SELECT id FROM people LIMIT 1000
),
person_delim_list as 
(
    SELECT pd.delimiter_id, pd.person_id
    FROM person_id_list pil
    INNER JOIN person_delimiters pd on pil.id = pd.person_id
)
SELECT D.*, pdl.person_id
FROM person_delim_list pdl
INNER JOIN delimiters d on pdl.delimiter_id = d.id

Sql - 使用连接从大表 (100M+) 中获取记录时查询速度慢，提示？

问题描述

1 个解决方案

解决方案1
0 2020-11-03 10:48:00

Sql - 使用连接从大表 (100M+) 中获取记录时查询速度慢，提示？

问题描述

1 个解决方案

解决方案1 0 2020-11-03 10:48:00

解决方案1
0 2020-11-03 10:48:00