Group By one same table taking long time

Question

I'm having some performance trouble with a PostgreSQL 9.3 on windows server.

CONTEXT:

I have a table containing hourly informations for an equipment for a given scenario for a given simulation.

The equipment is represented by a name
The hourly informations are annual informations (could be day but generally it's annual), so the data are going from hour 1 to hour 8736.
For the moment I have only one scenario but in the future I will have several scenarios.
I could have several data related to other simulation in the table but for now I only have one.

Table Creation + Index

This is the query used for the table creation (with the index):

CREATE TABLE equipment
(
simulation_id integer,
  name text,
  scenario integer,
  hour integer,
  data_1 real,
  data_2 real,
  data_3 real
)
WITH (
  OIDS=FALSE
);
ALTER TABLE equipment
  OWNER TO postgres;

CREATE INDEX equipment_ix_study_name_case_number_uid_scenario
  ON equipment
  USING btree
  (simulation_id, name COLLATE pg_catalog."default", scenario);

GOAL:

For a specific simulation, I need to be able to get for one equipment the information for each hour and the previous hour grouped by equipment name and hour (so I must have aggregated data without the scenario dimension).

QUERY:

This is my query:

SELECT 
    TRIM( TRIM('"' from equipment.name) )
    ,equipment.hour
    ,CAST(
        AVG(COALESCE(equipment.data_1, 0)) 
        AS NUMERIC
    )
    ,CAST(
        SUM(CASE equipment.data_3 WHEN 1 THEN 1 ELSE 0 END) 
        AS INTEGER
    ) 
    ,CAST(
        SUM(CASE equipment.data_3 WHEN 2 THEN 1 ELSE 0 END) 
        AS INTEGER
    )
    ,CAST(
        SUM(CASE WHEN equipment.data_1 = 0 AND prev.data_1 <> 0 AND prev.data_3 = 0 AND equipment.data_3 = 0 THEN 1 ELSE 0 END) 
        AS INTEGER
    )
    ,CAST(
        SUM(CASE WHEN equipment.data_1 <> 0 AND prev.data_1 = 0 AND prev.data_3 = 0 AND equipment.data_3 = 0 THEN 1 ELSE 0 END) 
        AS INTEGER
    )
FROM equipment 
LEFT JOIN equipment AS prev 
    ON (equipment.scenario = prev.scenario 
        AND equipment.name = prev.name 
        AND equipment.hour-1 = prev.hour
        AND equipment.simulation_id = prev.simulation_id) 
WHERE equipment.simulation_id = p_study_id
GROUP BY equipment.name, equipment.hour
;

TEST:

I have a simulation with 1294 equipments, one scenario, 8736 hours (I have 11 304 384 rows). The table has also some other data (but only one or two equipments, so it is marginal).

The query showed above is taking me more than 16 minutes ! The query plan is the following:

"GroupAggregate  (cost=5812882.98..6384397.56 rows=1128513 width=29)"
"  ->  Sort  (cost=5812882.98..5842752.93 rows=11947982 width=29)"
"        Sort Key: equipment.name, equipment.hour"
"        ->  Merge Left Join  (cost=3654539.47..4122515.94 rows=11947982 width=29)"
"              Merge Cond: ((equipment.scenario = prev.scenario) AND (equipment.name = prev.name) AND (((equipment.hour - 1)) = prev.hour))"
"              Join Filter: (equipment.simulation_id = prev.simulation_id)"
"              ->  Sort  (cost=1827269.74..1855482.56 rows=11285128 width=29)"
"                    Sort Key: equipment.scenario, equipment.name, ((equipment.hour - 1))"
"                    ->  Seq Scan on equipment  (cost=0.00..235326.89 rows=11285128 width=29)"
"                          Filter: (simulation_id = 40)"
"              ->  Materialize  (cost=1827269.74..1883695.38 rows=11285128 width=29)"
"                    ->  Sort  (cost=1827269.74..1855482.56 rows=11285128 width=29)"
"                          Sort Key: prev.scenario, prev.name, prev.hour"
"                          ->  Seq Scan on equipment prev  (cost=0.00..235326.89 rows=11285128 width=29)"
"                                Filter: (simulation_id = 40)"

ADDED: Explain Analyse

"GroupAggregate  (cost=5812882.98..6384397.56 rows=1128513 width=29) (actual time=912136.432..1039780.509 rows=11286912 loops=1)"
"  ->  Sort  (cost=5812882.98..5842752.93 rows=11947982 width=29) (actual time=912136.317..933297.923 rows=11286912 loops=1)"
"        Sort Key: equipment.name, equipment.hour"
"        Sort Method: external sort  Disk: 463168kB"
"        ->  Merge Left Join  (cost=3654539.47..4122515.94 rows=11947982 width=29) (actual time=424748.762..747696.137 rows=11286912 loops=1)"
"              Merge Cond: ((equipment.scenario = prev.scenario) AND (equipment.name = prev.name) AND (((equipment.hour - 1)) = prev.hour))"
"              Join Filter: (equipment.study_id = prev.study_id)"
"              ->  Sort  (cost=1827269.74..1855482.56 rows=11285128 width=29) (actual time=217975.391..319019.577 rows=11286912 loops=1)"
"                    Sort Key: equipment.scenario, equipment.name, ((equipment.hour - 1))"
"                    Sort Method: external merge  Disk: 507040kB"
"                    ->  Seq Scan on equipment  (cost=0.00..235326.89 rows=11285128 width=29) (actual time=0.031..26634.408 rows=11286912 loops=1)"
"                          Filter: (study_id = 40)"
"                          Rows Removed by Filter: 8736"
"              ->  Materialize  (cost=1827269.74..1883695.38 rows=11285128 width=29) (actual time=206773.352..342103.150 rows=11286912 loops=1)"
"                    ->  Sort  (cost=1827269.74..1855482.56 rows=11285128 width=29) (actual time=206773.343..304007.437 rows=11286912 loops=1)"
"                          Sort Key: prev.scenario, prev.name, prev.hour"
"                          Sort Method: external merge  Disk: 463000kB"
"                          ->  Seq Scan on equipment prev  (cost=0.00..235326.89 rows=11285128 width=29) (actual time=0.027..24442.187 rows=11286912 loops=1)"
"                                Filter: (study_id = 40)"
"                                Rows Removed by Filter: 8736"
"Total runtime: 1058464.697 ms"

Can someone explained me why the data are sorted and why is it taking so much time... Is there any possible improvement in order to get acceptable performance?

In advance, thanks !

Answer 1

Postgres chooses to sort the table (twice) in order to perform your LEFT JOIN via a merge join algorithm. Then it sorts the joined result in order to form your groups. Each sort takes a long time because you have 11M+ rows to sort, and because it has to do the sort externally, instead of in memory.

Suggestions:

Split your table into two, one associating each distinct equipment name with a unique integer ID, and the other containing the simulation data. In the simulation data and its index, identify the equipment via its unique ID instead of by name. This will take advantage of the fact that integers are much faster and easier to work with than strings, and they take less space, too.
Add hour to your index and make it an UNIQUE index. It is to be hoped that this will avoid the merge join and its two associated full-table sorts, in favor of something faster, such as a hash join using the index.
Consider moving scenario to be the first or second field of your index, so that when you have multiple scenarios, the pairs of rows you want to join will be indexed close together.
Change the join criterion equipment.hour-1 = prev.hour to equipment.hour = prev.hour+1 to possibly help the query optimizer choose a strategy that avoids the need for the third sort (because the grouping columns exactly match columns in the (equi-)join condition).
Make sure the query optimizer has good database statistics to work with: ensure the auto-vacuum daemon is enabled and has run at least once, or run the VACUUM ANALYZE command if you don't want auto-vacuuming.

Group By one same table taking long time

Question

1 answers

solution1
0 2015-01-08 19:07:48

Group By one same table taking long time

Question

1 answers

solution1 0 2015-01-08 19:07:48

solution1
0 2015-01-08 19:07:48