[英]Attempt to avoid duplicated code with CTE takes far too long
I need to create a view with some calculated and aggregated values.我需要创建一个包含一些计算和聚合值的视图。 So I need certain values multiple times, like total_dist_pts
in the example below.所以我需要多次使用某些值,例如下例中的total_dist_pts
。
There is a loc_a_run
table with about 350 rows so far (constantly growing) and a loc_a_pline
table with somewhat more than 4 million rows (also growing):到目前为止,有一个loc_a_run
表有大约 350 行(不断增长),还有一个loc_a_pline
表有超过 400 万行(也在增长):
prod=# \d loc_a_run
Table "public.loc_a_run"
Column | Type | Collation | Nullable | Default
----------------+--------------------------+-----------+----------+-----------------------------
loc_a_run_id | integer | | not null | generated always as identity
execution_time | timestamp with time zone | | not null | CURRENT_TIMESTAMP(0)
run_type_id | smallint | | not null |
...
has_errors | boolean | | |
Indexes:
"loc_a_run_pkey" PRIMARY KEY, btree (loc_a_run_id)
prod=# \d loc_a_pline
Table "public.loc_a_pline"
Column | Type | Collation | Nullable | Default
----------------+-----------+-----------+----------+-----------------------------
loc_a_pline_id | bigint | | not null | generated always as identity
loc_a_run_id | integer | | not null |
is_top | boolean | | not null |
...
dist_left | numeric | | not null |
dist_right | numeric | | not null |
...
Indexes:
"loc_a_pline_pkey" PRIMARY KEY, btree (loc_a_pline_id)
Foreign-key constraints:
"loc_a_pline_loc_a_run_id_fkey" FOREIGN KEY (loc_a_run_id) REFERENCES loc_a_run(loc_a_run_id) ON UPDATE CASCADE ON DELETE CASCADE
The solution I use right now:我现在使用的解决方案:
SELECT run.loc_a_run_id AS run_id
, run_type.run_type
, SUM(
CASE
WHEN pline.is_top IS true
THEN ROUND(pline.dist_right - pline.dist_left, 2)
ELSE ROUND(pline.dist_left - pline.dist_right, 2)
END)
AS total_dist_pts
, COUNT(pline.loc_a_pline_id) AS total_plines
, SUM(
CASE
WHEN pline.is_top IS true
THEN ROUND(pline.dist_right - pline.dist_left, 2)
ELSE ROUND(pline.dist_left - pline.dist_right, 2)
END)
/ COUNT(pline.loc_a_pline_id)
AS dist_pts_per_pline
FROM loc_a_run AS run
JOIN loc_a_pline AS pline USING (loc_a_run_id)
JOIN run_type USING (run_type_id)
WHERE run.has_errors IS false
GROUP BY run_id, run_type;
Query plan:查询计划:
"Finalize GroupAggregate (cost=154201.17..154577.71 rows=1365 width=108)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Gather Merge (cost=154201.17..154519.69 rows=2730 width=76)"
" Workers Planned: 2"
" -> Sort (cost=153201.15..153204.56 rows=1365 width=76)"
" Sort Key: run.loc_a_run_id, run_type.run_type"
" -> Partial HashAggregate (cost=153113.01..153130.07 rows=1365 width=76)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Hash Join (cost=21.67..120633.75 rows=1623963 width=62)"
" Hash Cond: (run.run_type_id = run_type.run_type_id)"
" -> Hash Join (cost=20.55..112756.41 rows=1623963 width=32)"
" Hash Cond: (pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline pline (cost=0.00..107766.55 rows=1867855 width=30)"
" -> Hash (cost=17.14..17.14 rows=273 width=6)"
" -> Seq Scan on loc_a_run run (cost=0.00..17.14 rows=273 width=6)"
" Filter: (has_errors IS FALSE)"
" -> Hash (cost=1.05..1.05 rows=5 width=34)"
" -> Seq Scan on loc_a_run_type run_type (cost=0.00..1.05 rows=5 width=34)"
This takes around 14.2s to execute.这需要大约 14.2 秒来执行。 I lack the experience to assess how good or bad the performance is for this part, but I could live with it.我缺乏评估这部分的表现好坏的经验,但我可以忍受。 Of course, faster would be an advantage.当然,更快将是一个优势。
Because this contains duplicated code I tried to get rid of it by using a CTE (in the final view I need this for a few more calculations, but the pattern is the same):因为这包含重复的代码,我试图通过使用 CTE 来摆脱它(在最终视图中,我需要它进行更多计算,但模式是相同的):
WITH dist_pts AS (
SELECT loc_a_run_id
, CASE
WHEN is_top IS true
THEN ROUND(dist_right - dist_left, 2)
ELSE ROUND(dist_left - dist_right, 2)
END AS pts
FROM loc_a_pline
)
SELECT run.loc_a_run_id AS run_id
, run_type.run_type
, SUM(dist_pts.pts) AS total_dist_pts
, COUNT(pline.loc_a_pline_id) AS total_plines
, SUM(dist_pts.pts) / COUNT(pline.loc_a_pline_id) AS dist_pts_per_pline
FROM loc_a_run AS run
JOIN dist_pts USING (loc_a_run_id)
JOIN loc_a_pline AS pline USING (loc_a_run_id)
JOIN run_type USING (run_type_id)
WHERE run.has_errors IS false
GROUP BY run_id, run_type;
Query plan:查询计划:
"Finalize GroupAggregate (cost=575677889.59..575678266.13 rows=1365 width=108)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Gather Merge (cost=575677889.59..575678208.12 rows=2730 width=76)"
" Workers Planned: 2"
" -> Sort (cost=575676889.57..575676892.98 rows=1365 width=76)"
" Sort Key: run.loc_a_run_id, run_type.run_type"
" -> Partial HashAggregate (cost=575676801.43..575676818.49 rows=1365 width=76)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Parallel Hash Join (cost=155366.81..111024852.15 rows=23232597464 width=62)"
" Hash Cond: (loc_a_pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline (cost=0.00..107877.85 rows=1869785 width=22)"
" -> Parallel Hash (cost=120758.30..120758.30 rows=1625641 width=48)"
" -> Hash Join (cost=21.67..120758.30 rows=1625641 width=48)"
" Hash Cond: (run.run_type_id = run_type.run_type_id)"
" -> Hash Join (cost=20.55..112872.83 rows=1625641 width=18)"
" Hash Cond: (pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline pline (cost=0.00..107877.85 rows=1869785 width=12)"
" -> Hash (cost=17.14..17.14 rows=273 width=6)"
" -> Seq Scan on loc_a_run run (cost=0.00..17.14 rows=273 width=6)"
" Filter: (has_errors IS FALSE)"
" -> Hash (cost=1.05..1.05 rows=5 width=34)"
" -> Seq Scan on loc_a_run_type run_type (cost=0.00..1.05 rows=5 width=34)"
This takes forever and seems to be the wrong approach.这需要很长时间,而且似乎是错误的方法。 I struggle to understand the query plan to find my mistake(s).我很难理解查询计划以找到我的错误。
So my questions are:所以我的问题是:
SUM(dist_pts.pts)
only once?有没有办法SUM(dist_pts.pts)
一次?COUNT(pline.loc_a_pline_id)
in the same go as the subtraction in the CTE instead of accessing the big loc_a_pline
table again?有没有办法在同一个 go 中进行COUNT(pline.loc_a_pline_id)
作为 CTE 中的减法,而不是再次访问大loc_a_pline
表? (is it accessed again at all?) (它是否再次访问?)Any help is highly appreciated非常感谢任何帮助
Consider creating an index on loc_a_pline.loc_a_run_id
.考虑在loc_a_pline.loc_a_run_id
上创建索引。 Postgres doesn't automatically create indexes on the referencing side of FK relationships. Postgres 不会自动在 FK 关系的引用端创建索引。 That should drastically improve the speed and remove the sequential scans over loc_a_pline
from the execution plan.这应该会大大提高速度并从执行计划中删除对loc_a_pline
的顺序扫描。
Additionally, I'd suggest gathering all the data you want in the first portion of the CTE and then separating the aggregates out into their own portion.此外,我建议在 CTE 的第一部分收集您想要的所有数据,然后将聚合分离到它们自己的部分。 Something like this that accesses all of the tables once and aggregates over the set once:像这样访问所有表一次并在集合上聚合一次的东西:
WITH dist_pts AS (
SELECT run.loc_a_run_id rid
, pline.loc_a_pline_id pid
, rt.run_type
, CASE
WHEN pline.is_top IS true
THEN ROUND(pline.dist_right - pline.dist_left, 2)
ELSE ROUND(pline.dist_left - pline.dist_right, 2)
END AS pts
FROM loc_a_run run
JOIN loc_a_pline pline ON run.loc_a_run_id = pline.loc_a_run_id
JOIN run_type rt ON run.run_type_id = rt.run_type_id
WHERE run.has_errors IS FALSE
), aggs AS
(
SELECT SUM(dist_pts.pts) total_dist_pts
, COUNT(dist_pts.pid) total_plines
, dist_pts.rid
FROM dist_pts
GROUP BY dist_pts.rid
)
SELECT dp.rid
, dp.run_type
, aggs.total_dist_pts
, aggs.total_plines
, (aggs.total_dist_pts::DOUBLE PRECISION / aggs.total_plines::DOUBLE PRECISION) dist_pts_per_pline -- No integer division please
FROM dist_pts dp
JOIN aggs ON dp.rid = aggs.rid
GROUP BY dp.rid, dp.run_type
;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.