[英]Sql - Slow query when fetching records from large table (100M+) using join, tips?
Working on improving the query below on some large tables (Im using Postgres v12.4):致力于在一些大表上改进以下查询(我使用的是 Postgres v12.4):
people : 137 Million records
delimiters: 1.2 Million records
person_delimiters: 329 Million records
SELECT "delimiters".*, "a"."person_id" FROM "delimiters"
INNER JOIN "person_delimiters" as "a" on "delimiters"."id" = "a"."delimiter_id"
WHERE ("a"."person_id" IN (SELECT id FROM "people" LIMIT 1000));
(the subquery using LIMIT 1000 is here for generic purposes, on its real application though, I get specific sets of 1000 person ids) (使用 LIMIT 1000 的子查询是出于通用目的,但在其实际应用程序中,我得到了 1000 个人 ID 的特定集合)
person_delimiters
is an intermediary table that has two columns ( person_id
, delimiter_id
); person_delimiters
是一个中间表,有两列( person_id
, delimiter_id
);
Output from EXPLAIN ANALYZE:解释分析的输出:
Hash Join (cost=46025.12..11207871.13 rows=164750752 width=2346) (actual time=65659.597..66354.997 rows=1044 loops=1)
Hash Cond: (a.delimiter_id = delimiters.id)
-> Hash Semi Join (cost=57.96..8456389.56 rows=164750752 width=32) (actual time=20.582..64777.963 rows=1044 loops=1)
Hash Cond: (a.person_id = people.id)
-> Seq Scan on person_delimiters a (cost=0.00..5758538.04 rows=329501504 width=32) (actual time=0.008..30865.854 rows=329501518 loops=1)
-> Hash (cost=45.46..45.46 rows=1000 width=16) (actual time=0.384..0.385 rows=1000 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 55kB
-> Limit (cost=0.00..35.46 rows=1000 width=16) (actual time=0.004..0.241 rows=1000 loops=1)
-> Seq Scan on people (cost=0.00..4888821.40 rows=137873840 width=16) (actual time=0.003..0.158 rows=1000 loops=1)
-> Hash (cost=24508.85..24508.85 rows=71385 width=2330) (actual time=839.841..839.841 rows=1227076 loops=1)
Buckets: 2048 Batches: 64 Memory Usage: 3015kB
-> Seq Scan on delimiters (cost=0.00..24508.85 rows=71385 width=2330) (actual time=0.007..303.814 rows=1227076 loops=1)
Planning Time: 1.197 ms
Execution Time: 66355.110 ms
Indexes:索引:
-- person_delimiters --
public | person_delimiters | person_delimiters_pkey | | CREATE UNIQUE INDEX person_delimiters_pkey ON public.person_delimiters USING btree (
person_id, delimiter_id)
public | person_delimiters | idx_person_delimiters_delimiter_id_person_id | | CREATE INDEX idx_person_delimiters_delimiter_id_person_id ON public.person_delimiter
s USING btree (delimiter_id, person_id)
-- people --
public | people | people_two_pkey | | CREATE UNIQUE INDEX people_pkey ON public.people USING btree (id)
-- delimiters --
public | delimiters | delimiters_pkey | | CREATE UNIQUE INDEX delimiters_pkey ON public.delimiters USING btree (id)
Anything I could work on, to optimize it?我可以做些什么来优化它?
it normally works best to start off with the smallest possible table/dataset and then work up from there.通常最好从尽可能小的表/数据集开始,然后从那里开始工作。 The following is probably the most efficient way of approaching the problem in SQL (the 2nd CTE is really there to clarify the approach - you could probably move the logic into the main select statement without affecting much).
以下可能是解决 SQL 中问题的最有效方法(第二个 CTE 确实在那里澄清了该方法 - 您可以将逻辑移到主 select 语句中而不会产生太大影响)。
If this doesn't improve the performance significantly then you are probably looking at indexing/partitioning/temporary table solutions.如果这不会显着提高性能,那么您可能正在寻找索引/分区/临时表解决方案。
with person_id_list as
(
SELECT id FROM people LIMIT 1000
),
person_delim_list as
(
SELECT pd.delimiter_id, pd.person_id
FROM person_id_list pil
INNER JOIN person_delimiters pd on pil.id = pd.person_id
)
SELECT D.*, pdl.person_id
FROM person_delim_list pdl
INNER JOIN delimiters d on pdl.delimiter_id = d.id
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.