Optimizing ST_Intersects in PostgreSQL(PostGIS)

Question

The query below takes almost 15 min for the result to show up. And I am wondering why? Because of the data? Or the vertices of the geometries? When I tried the query with a different table (small sized shapefile) it runs fast.

Here's the query. (Thanks to Patrick for this):

WITH hi AS (
  SELECT ps.id, ps.brgy_locat, ps.municipali
  FROM evidensapp_polystructures ps
  JOIN evidensapp_seniangcbr fh ON fh.hazard = 'High'
                                 AND ST_Intersects(fh.geom, ps.geom)
), med AS (
  SELECT ps.id, ps.brgy_locat, ps.municipali
  FROM evidensapp_polystructures ps
  JOIN evidensapp_seniangcbr fh ON fh.hazard = 'Medium'
                                 AND ST_Intersects(fh.geom, ps.geom)
  EXCEPT SELECT * FROM hi
), low AS (
  SELECT ps.id, ps.brgy_locat, ps.municipali
  FROM evidensapp_polystructures ps
  JOIN evidensapp_seniangcbr fh ON fh.hazard = 'Low'
                                 AND ST_Intersects(fh.geom, ps.geom)
  EXCEPT SELECT * FROM hi
  EXCEPT SELECT * FROM med
)
SELECT brgy_locat AS barangay, municipali AS municipality, high, medium, low
FROM (SELECT brgy_locat, municipali, count(*) AS high
      FROM hi
      GROUP BY 1, 2) cnt_hi
FULL JOIN (SELECT brgy_locat, municipali, count(*) AS medium
      FROM med
      GROUP BY 1, 2) cnt_med USING (brgy_locat, municipali)
FULL JOIN (SELECT brgy_locat, municipali, count(*) AS low
      FROM low
      GROUP BY 1, 2) cnt_low USING (brgy_locat, municipali);

PostgreSQL 9.3, PostGIS 2.1.5

Table Polystructures : contains 9847 rows:

CREATE TABLE evidensapp_polystructures (
  id serial NOT NULL PRIMARY KEY,
  bldg_name character varying(100) NOT NULL,
  bldg_type character varying(50) NOT NULL,
  brgy_locat character varying(50) NOT NULL,
  municipali character varying(50) NOT NULL,
  province character varying(50) NOT NULL,
  geom geometry(MultiPolygon,32651)
);

CREATE INDEX evidensapp_polystructures_geom_id
  ON evidensapp_polystructures USING gist (geom);
ALTER TABLE evidensapp_polystructures CLUSTER ON evidensapp_polystructures_geom_id;

Table SeniangCBR : only 6 rows, shapefile size (if it matters): 52,060 KB

CREATE TABLE evidensapp_seniangcbr (
  id serial NOT NULL PRIMARY KEY,
  hazard character varying(16) NOT NULL,
  geom geometry(MultiPolygon,32651)
);

CREATE INDEX evidensapp_seniangcbr_geom_id ON evidensapp_seniangcbr USING gist (geom);
ALTER TABLE evidensapp_seniangcbr CLUSTER ON evidensapp_seniangcbr_geom_id;

All the data were automatically loaded into the database by using LayerMapping utility as I am using Django(GeoDjango) .

EXPLAIN ANALYZE LINK HERE.

I don't have a server right now, I run the query on my PC.

Processor: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz (8 CPUs), ~3.6GHz
Memory: 8192MB RAM
OS: Windows 7 64-bit

Answer 1

The EXPLAIN ANALYZE output is hard to read because all the fields and functions are scrambled into radio alphabet . That said, two things stand out:

Most time is spent in the ST_Intersects() function and this is not surprising.
The EXCEPT clause appears to be rather inefficient too.

So please try this, rather less verbose, version:

SELECT brgy_locat AS barangay, municipali AS municipality,
       sum(CASE max_hz_id WHEN 3 THEN 1 ELSE 0 END) AS high,
       sum(CASE max_hz_id WHEN 2 THEN 1 ELSE 0 END) AS medium,
       sum(CASE max_hz_id WHEN 1 THEN 1 ELSE 0 END) AS low
FROM (
  SELECT ps.id, ps.brgy_locat, ps.municipali,
         max(CASE fh.hazard WHEN 'Low' THEN 1 WHEN 'Medium' THEN 2 WHEN 'High' THEN 3 END) AS max_hz_id
  FROM evidensapp_polystructures ps
  JOIN evidensapp_seniangcbr fh ON ST_Intersects(fh.geom, ps.geom)
  GROUP BY 1, 2, 3
) AS ps_fh
GROUP BY 1, 2;

There is now only a single call to ST_Intersects() which is possibly (hopefully) quite a bit faster than three calls on sub-sets of the hazard map (due to internal efficiencies in the PostGIS code).

As is clear, the hazard class string is converted into a range of integers, that allow easy ordering and comparison. In the inner query, the maximum hazard value is selected, corresponding to your requirement. In the main query those maximum values per structure are summed into their respective columns. If at all possible, change your table structure to use those three integer codes and link to a helper table for the class label: your table would get smaller and therefore faster and the CASE statement in the inner query could be dropped. Alternatively, add a column with the integer code and update values according to the "hazard" column.

Note that these CASE statements are not very efficient (reason why I used the EXCEPT clause in the previous answer). In PG 9.4 a new FILTER clause on aggregate functions is introduced which would make the query faster and easier to read:

count(id) FILTER (WHERE max_hz_id = 3) AS high

You might want to consider an upgrade.

Selamat mula Maynila

Answer 2

Add a bounding_box geometry(Polygon,4326) column to your table. The value of the column would be a bounding box (max x,y and min x,y of the multipolygon ) that completely encapsulates the multipolygon .

Then your query would look like this:

AND ST_Intersects(fh.bounding_box, ps.bounding_box)
AND ST_Intersects(fh.geom, ps.geom)

The advantage of this is that the first ST_Intersects call is pretty fast. If it returns false, the second, more involved ST_Intersects call is never invoked, saving you some time in that case.

Answer 3

Similar to what I suggested and explained under your related question , I would use UNION ALL instead of FULL JOIN in the outer SELECT .

WITH hi AS (
   SELECT ps.brgy_locat, ps.municipali, fh.hazard, count(*) AS ct
   FROM   evidensapp_seniangcbr     fh
   JOIN   evidensapp_polystructures ps ON ST_Intersects(fh.geom, ps.geom)
   WHERE  fh.hazard = 'High'
   GROUP  BY 1, 2, 3
   )
, med AS (
   SELECT ps.brgy_locat, ps.municipali, fh.hazard, count(*) AS ct
   FROM   evidensapp_seniangcbr     fh
   JOIN   evidensapp_polystructures ps ON ST_Intersects(fh.geom, ps.geom)
   LEFT   JOIN hi USING (brgy_locat, municipali)
   WHERE  fh.hazard = 'Medium'
   AND    hi.brgy_locat IS NULL
   GROUP  BY 1, 2, 3
   )
TABLE hi

UNION ALL
TABLE med

UNION ALL
   SELECT ps.brgy_locat, ps.municipali, fh.hazard, count(*) AS ct
   FROM   evidensapp_seniangcbr     fh
   JOIN   evidensapp_polystructures ps ON ST_Intersects(fh.geom, ps.geom)
   LEFT   JOIN hi  USING (brgy_locat, municipali)
   LEFT   JOIN med USING (brgy_locat, municipali)
   WHERE  fh.hazard = 'Low'
   AND    hi.brgy_locat IS NULL
   AND    med.brgy_locat IS NULL
   GROUP BY 1, 2, 3;

This only considers the highest hazard level for each set of rows with identical (brgy_locat, municipali) . Only rows that actually intersect with any row of relevant hazard level in evidensapp_seniangcbr are in the result. Also, the count only counts the rows that actually intersect. There may be more rows with the same (brgy_locat, municipali) in evidensapp_polystructures , just not intersecting with the same hazard level and therefore ignored.

Pick one of the standard methods to exclude rows for which you already found a match in a higher hazard level in the lower levels.

Select rows which are not present in other table

LEFT JOIN / IS NULL should use the index on id and perform very well here. Certainly faster than using EXCEPT based on the whole row, which cannot use an index.

Index

You do not need to add a bounding_box geometry column to your table like another answer suggested. PostGIS uses (index-backed) bounding box comparison automatically in modern versions. The PostGIS documentation:

This function call will automatically include a bounding box comparison that will make use of any indexes that are available on the geometries.

In fact, we already see index scans in the explain output you posted.

Your existing GiST index evidensapp_polystructures_geom_id should make the query fast.
_{Aside: the name of the index should probably be evidensapp_polystructures_geom_idx .}

In addition, create an index on (brgy_locat, municipali) if you don't have one, yet:

CREATE INDEX foo_idx ON evidensapp_polystructures (brgy_locat, municipali);

Alternative with `LATERAL` join

Since you have only 6 rows in evidensapp_seniangcbr , LATERAL joins may be faster:

WITH hi AS (
   SELECT ps.brgy_locat, ps.municipali, fh.hazard, count(*) AS ct
   FROM   evidensapp_seniangcbr fh
        , LATERAL (
      SELECT ps.brgy_locat, ps.municipali
      FROM   evidensapp_polystructures ps
      WHERE  ST_Intersects(fh.geom, ps.geom)
      ) ps
   WHERE  fh.hazard = 'High'
   GROUP  BY 1, 2, 3
   )
, med AS (
   SELECT ps.brgy_locat, ps.municipali, fh.hazard, count(*) AS ct
   FROM   evidensapp_seniangcbr fh
        , LATERAL (
      SELECT ps.brgy_locat, ps.municipali
      FROM   evidensapp_polystructures ps
      LEFT   JOIN hi USING (brgy_locat, municipali)
      WHERE  hi.brgy_locat IS NULL
      AND    ST_Intersects(fh.geom, ps.geom)
      ) ps
   WHERE  fh.hazard = 'Medium'
   GROUP  BY 1, 2, 3
   )
TABLE hi

UNION ALL
TABLE med

UNION ALL
   SELECT ps.brgy_locat, ps.municipali, fh.hazard, count(*) AS ct
   FROM   evidensapp_seniangcbr fh
        , LATERAL (
      SELECT ps.id, ps.brgy_locat, ps.municipali
      FROM   evidensapp_polystructures ps
      LEFT   JOIN hi  USING (brgy_locat, municipali)
      LEFT   JOIN med USING (brgy_locat, municipali)
      WHERE  hi.brgy_locat IS NULL
      AND    med.brgy_locat IS NULL
      AND    ST_Intersects(fh.geom, ps.geom)
      ) ps
   WHERE  fh.hazard = 'Low'
   GROUP  BY 1, 2, 3;

About LATERAL joins:

What is the difference between LATERAL and a subquery in PostgreSQL?

Optimizing ST_Intersects in PostgreSQL(PostGIS)

Question

3 answers

solution1
2 2015-08-04 04:36:06

solution2
1 2015-08-16 19:05:27

solution3
1 ACCPTED 2015-08-18 02:16:31

Index

Alternative with `LATERAL` join

Optimizing ST_Intersects in PostgreSQL(PostGIS)

Question

3 answers

solution1 2 2015-08-04 04:36:06

solution2 1 2015-08-16 19:05:27

solution3 1 ACCPTED 2015-08-18 02:16:31

Index

Alternative with LATERAL join

solution1
2 2015-08-04 04:36:06

solution2
1 2015-08-16 19:05:27

solution3
1 ACCPTED 2015-08-18 02:16:31

Alternative with `LATERAL` join