Improve query performance - do not use index scan

Question

I've got 2 tables:

event - list of events which should be processed (new files and dirs). Size: ~2M rows
dir_current - list of directories currently visible on filesystem. Size: ~1M but up to 100M in the future.

I use stored procedure to process events and turn them into dir_current rows. First step of processing events is to find all rows that do not have parent in dir_current table. Unfortunately this get a little more complicated as parent might be present in event table so we don't want to include them in result. I came up with this query:

SELECT DISTINCT event.parent_path, event.depth FROM sf.event as event
    LEFT OUTER JOIN sf.dir_current as dir ON
        event.parent_path = dir.path
        AND dir.volume_id = 1
    LEFT OUTER JOIN sf.event as event2 ON
        event.parent_path = event2.path
        AND event2.volume_id = 1
        AND event2.type = 'DIR'
        AND event2.id <= MAX_ID_VARIABLE
    WHERE
        event.volume_id = 1
        AND event.id <= MAX_ID_VARIABLE
        AND dir.volume_id IS NULL
        AND event2.id IS NULL
    ORDER BY depth, parent_path;

MAX_ID_VARIABLE is variable limiting number of events processed at once.

Below is explain analyze result ( explain.depesz.com ):

                                                                               QUERY PLAN                                                                                 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=395165.81..395165.82 rows=1 width=83) (actual time=32009.439..32049.675 rows=2462 loops=1)
   ->  Sort  (cost=395165.81..395165.81 rows=1 width=83) (actual time=32009.432..32021.733 rows=184975 loops=1)
         Sort Key: event.depth, event.parent_path
         Sort Method: quicksort  Memory: 38705kB
         ->  Nested Loop Anti Join  (cost=133385.93..395165.80 rows=1 width=83) (actual time=235.581..30916.912 rows=184975 loops=1)
               ->  Hash Anti Join  (cost=133385.38..395165.14 rows=1 width=83) (actual time=83.073..1703.618 rows=768278 loops=1)
                     Hash Cond: (event.parent_path = event2.path)
                     ->  Seq Scan on event  (cost=0.00..252872.92 rows=2375157 width=83) (actual time=0.014..756.014 rows=2000000 loops=1)
                           Filter: ((id <= 13000000) AND (volume_id = 1))
                     ->  Hash  (cost=132700.54..132700.54 rows=54787 width=103) (actual time=82.754..82.754 rows=48029 loops=1)
                           Buckets: 65536  Batches: 1  Memory Usage: 6696kB
                           ->  Bitmap Heap Scan on event event2  (cost=6196.07..132700.54 rows=54787 width=103) (actual time=12.979..63.803 rows=48029 loops=1)
                                 Recheck Cond: (type = '16384'::text)
                                 Filter: ((id <= 13000000) AND (volume_id = 1))
                                 Heap Blocks: exact=16465
                                 ->  Bitmap Index Scan on event_dir_depth_idx  (cost=0.00..6182.38 rows=54792 width=0) (actual time=8.759..8.759 rows=48029 loops=1)
               ->  Index Only Scan using dircurrent_volumeid_path_unq on dir_current dir  (cost=0.55..0.65 rows=1 width=115) (actual time=0.038..0.038 rows=1 loops=768278)
                     Index Cond: ((volume_id = 1) AND (path = event.parent_path))
                     Heap Fetches: 583027
 Planning time: 2.114 ms
 Execution time: 32054.498 ms

The slowest part is Index Only Scan on dir_current table (took 29 sec from 32 sec total).

I wonder why Postgres is using index scan instead of sequential scan which would take 2-3 seconds.

After setting:

SET enable_indexscan TO false;
SET enable_bitmapscan TO false;

I received query that runs in 3 sec explain.depesz.com :

                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=569654.93..569654.94 rows=1 width=83) (actual time=3943.487..3979.613 rows=2462 loops=1)
   ->  Sort  (cost=569654.93..569654.93 rows=1 width=83) (actual time=3943.481..3954.169 rows=184975 loops=1)
         Sort Key: event.depth, event.parent_path
         Sort Method: quicksort  Memory: 38705kB
         ->  Hash Anti Join  (cost=307875.14..569654.92 rows=1 width=83) (actual time=1393.185..2970.626 rows=184975 loops=1)
               Hash Cond: ((event.parent_path = dir.path) AND ((event.depth - 1) = dir.depth))
               ->  Hash Anti Join  (cost=259496.25..521276.01 rows=1 width=83) (actual time=786.617..2111.297 rows=768278 loops=1)
                     Hash Cond: (event.parent_path = event2.path)
                     ->  Seq Scan on event  (cost=0.00..252872.92 rows=2375157 width=83) (actual time=0.016..616.598 rows=2000000 loops=1)
                           Filter: ((id <= 13000000) AND (volume_id = 1))
                     ->  Hash  (cost=258811.41..258811.41 rows=54787 width=103) (actual time=786.214..786.214 rows=48029 loops=1)
                           Buckets: 65536  Batches: 1  Memory Usage: 6696kB
                           ->  Seq Scan on event event2  (cost=0.00..258811.41 rows=54787 width=103) (actual time=0.068..766.563 rows=48029 loops=1)
                                 Filter: ((id <= 13000000) AND (volume_id = 1) AND (type = '16384'::text))
                                 Rows Removed by Filter: 1951971
               ->  Hash  (cost=36960.95..36960.95 rows=761196 width=119) (actual time=582.430..582.430 rows=761196 loops=1)
                     Buckets: 1048576  Batches: 1  Memory Usage: 121605kB
                     ->  Seq Scan on dir_current dir  (cost=0.00..36960.95 rows=761196 width=119) (actual time=0.010..267.484 rows=761196 loops=1)
                           Filter: (volume_id = 1)
 Planning time: 2.242 ms
 Execution time: 3999.213 ms

Both tables were analyzed before running queries.

Any idea why is Postgres using far from optimal query plan? Is there a better way to improve query performance then disabling index/bitmap scan? Maybe different query with same result?

I am using Postgres 9.5.2 I would be grateful for any help.

Answer 1

You are only fetching columns from one table. I would recommend rewriting the query as:

SELECT e.parent_path, e.depth
FROM sf.event e
WHERE e.volume_id = 1 AND e.id <= MAX_ID_VARIABLE AND
      NOT EXISTS (SELECT 1
                  FROM dir_current dc
                  WHERE e.parent_path = dc.path AND dc.volume_id = 1
                 ) AND
      NOT EXISTS (SELECT 1
                  FROM sf.event e2 ON
                       e.parent_path = e2.path AND
                       e2.volume_id = 1 AND
                       e2.type = 'DIR' AND
                       e2.id <= MAX_ID_VARIABLE
                 )
ORDER BY e.depth, e.parent_path;

Then the following indexes:

event(volume_id, id)
dir_current(path, volume_id)
event(path, volume_id, type, id)

I'm not sure why there is a comparison to MAX_ID_VARIABLE . Without this comparison, the first index can include the sort keys: event(volume_id, depth, parent_path) .

Improve query performance - do not use index scan

Question

1 answers

solution1
2 2016-04-14 14:32:52

Improve query performance - do not use index scan

Question

1 answers

solution1 2 2016-04-14 14:32:52

solution1
2 2016-04-14 14:32:52