I'm having some performance trouble with a PostgreSQL 9.3 on windows server.
CONTEXT:
I have a table containing hourly informations for an equipment for a given scenario for a given simulation.
Table Creation + Index
This is the query used for the table creation (with the index):
CREATE TABLE equipment
(
simulation_id integer,
name text,
scenario integer,
hour integer,
data_1 real,
data_2 real,
data_3 real
)
WITH (
OIDS=FALSE
);
ALTER TABLE equipment
OWNER TO postgres;
CREATE INDEX equipment_ix_study_name_case_number_uid_scenario
ON equipment
USING btree
(simulation_id, name COLLATE pg_catalog."default", scenario);
GOAL:
For a specific simulation, I need to be able to get for one equipment the information for each hour and the previous hour grouped by equipment name and hour (so I must have aggregated data without the scenario dimension).
QUERY:
This is my query:
SELECT
TRIM( TRIM('"' from equipment.name) )
,equipment.hour
,CAST(
AVG(COALESCE(equipment.data_1, 0))
AS NUMERIC
)
,CAST(
SUM(CASE equipment.data_3 WHEN 1 THEN 1 ELSE 0 END)
AS INTEGER
)
,CAST(
SUM(CASE equipment.data_3 WHEN 2 THEN 1 ELSE 0 END)
AS INTEGER
)
,CAST(
SUM(CASE WHEN equipment.data_1 = 0 AND prev.data_1 <> 0 AND prev.data_3 = 0 AND equipment.data_3 = 0 THEN 1 ELSE 0 END)
AS INTEGER
)
,CAST(
SUM(CASE WHEN equipment.data_1 <> 0 AND prev.data_1 = 0 AND prev.data_3 = 0 AND equipment.data_3 = 0 THEN 1 ELSE 0 END)
AS INTEGER
)
FROM equipment
LEFT JOIN equipment AS prev
ON (equipment.scenario = prev.scenario
AND equipment.name = prev.name
AND equipment.hour-1 = prev.hour
AND equipment.simulation_id = prev.simulation_id)
WHERE equipment.simulation_id = p_study_id
GROUP BY equipment.name, equipment.hour
;
TEST:
I have a simulation with 1294 equipments, one scenario, 8736 hours (I have 11 304 384 rows). The table has also some other data (but only one or two equipments, so it is marginal).
The query showed above is taking me more than 16 minutes ! The query plan is the following:
"GroupAggregate (cost=5812882.98..6384397.56 rows=1128513 width=29)"
" -> Sort (cost=5812882.98..5842752.93 rows=11947982 width=29)"
" Sort Key: equipment.name, equipment.hour"
" -> Merge Left Join (cost=3654539.47..4122515.94 rows=11947982 width=29)"
" Merge Cond: ((equipment.scenario = prev.scenario) AND (equipment.name = prev.name) AND (((equipment.hour - 1)) = prev.hour))"
" Join Filter: (equipment.simulation_id = prev.simulation_id)"
" -> Sort (cost=1827269.74..1855482.56 rows=11285128 width=29)"
" Sort Key: equipment.scenario, equipment.name, ((equipment.hour - 1))"
" -> Seq Scan on equipment (cost=0.00..235326.89 rows=11285128 width=29)"
" Filter: (simulation_id = 40)"
" -> Materialize (cost=1827269.74..1883695.38 rows=11285128 width=29)"
" -> Sort (cost=1827269.74..1855482.56 rows=11285128 width=29)"
" Sort Key: prev.scenario, prev.name, prev.hour"
" -> Seq Scan on equipment prev (cost=0.00..235326.89 rows=11285128 width=29)"
" Filter: (simulation_id = 40)"
ADDED: Explain Analyse
"GroupAggregate (cost=5812882.98..6384397.56 rows=1128513 width=29) (actual time=912136.432..1039780.509 rows=11286912 loops=1)"
" -> Sort (cost=5812882.98..5842752.93 rows=11947982 width=29) (actual time=912136.317..933297.923 rows=11286912 loops=1)"
" Sort Key: equipment.name, equipment.hour"
" Sort Method: external sort Disk: 463168kB"
" -> Merge Left Join (cost=3654539.47..4122515.94 rows=11947982 width=29) (actual time=424748.762..747696.137 rows=11286912 loops=1)"
" Merge Cond: ((equipment.scenario = prev.scenario) AND (equipment.name = prev.name) AND (((equipment.hour - 1)) = prev.hour))"
" Join Filter: (equipment.study_id = prev.study_id)"
" -> Sort (cost=1827269.74..1855482.56 rows=11285128 width=29) (actual time=217975.391..319019.577 rows=11286912 loops=1)"
" Sort Key: equipment.scenario, equipment.name, ((equipment.hour - 1))"
" Sort Method: external merge Disk: 507040kB"
" -> Seq Scan on equipment (cost=0.00..235326.89 rows=11285128 width=29) (actual time=0.031..26634.408 rows=11286912 loops=1)"
" Filter: (study_id = 40)"
" Rows Removed by Filter: 8736"
" -> Materialize (cost=1827269.74..1883695.38 rows=11285128 width=29) (actual time=206773.352..342103.150 rows=11286912 loops=1)"
" -> Sort (cost=1827269.74..1855482.56 rows=11285128 width=29) (actual time=206773.343..304007.437 rows=11286912 loops=1)"
" Sort Key: prev.scenario, prev.name, prev.hour"
" Sort Method: external merge Disk: 463000kB"
" -> Seq Scan on equipment prev (cost=0.00..235326.89 rows=11285128 width=29) (actual time=0.027..24442.187 rows=11286912 loops=1)"
" Filter: (study_id = 40)"
" Rows Removed by Filter: 8736"
"Total runtime: 1058464.697 ms"
Can someone explained me why the data are sorted and why is it taking so much time... Is there any possible improvement in order to get acceptable performance?
In advance, thanks !
Postgres chooses to sort the table (twice) in order to perform your LEFT JOIN
via a merge join algorithm. Then it sorts the joined result in order to form your groups. Each sort takes a long time because you have 11M+ rows to sort, and because it has to do the sort externally, instead of in memory.
Suggestions:
hour
to your index and make it an UNIQUE
index. It is to be hoped that this will avoid the merge join and its two associated full-table sorts, in favor of something faster, such as a hash join using the index. scenario
to be the first or second field of your index, so that when you have multiple scenarios, the pairs of rows you want to join will be indexed close together. equipment.hour-1 = prev.hour
to equipment.hour = prev.hour+1
to possibly help the query optimizer choose a strategy that avoids the need for the third sort (because the grouping columns exactly match columns in the (equi-)join condition). VACUUM ANALYZE
command if you don't want auto-vacuuming.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.