Optimizing postgres delete query

Question

Goal

I have a table matches and table2 such that matches.key and table2.key has many-to-many relationship.

matches 
-------
key (bigint), other columns...
---
1
2
1

table2 
-------
key (bigint), createdAt (date), other columns...
---
1
2
2
1

I want to delete all "orphan" records in table2 which has a key which does not exist in matches AND these records created before 5 hours ago.

What We Did So Far

explain (analyse,buffers) delete from table2 as mo
      where not exists (select null from matches pf where pf.key=mo.key)
      and mo."createdAt" < now() - interval '5 hours'

I'm running the delete query every 5 seconds. I can change it if it will worth it.

It's working but it's slow (600k records in table2 and 1k records in matches ):

[
  {
    "QUERY PLAN": "Delete on table2 mo  (cost=127.40..33648.30 rows=1 width=12) (actual time=248.302..248.305 rows=0 loops=1)"
  },
  {
    "QUERY PLAN": "  Buffers: shared hit=9435 read=11203"
  },
  {
    "QUERY PLAN": "  I/O Timings: read=23.365"
  },
  {
    "QUERY PLAN": "  ->  Hash Anti Join  (cost=127.40..33648.30 rows=1 width=12) (actual time=248.300..248.302 rows=0 loops=1)"
  },
  {
    "QUERY PLAN": "        Hash Cond: (mo.\"key\" = pf.\"key\")"
  },
  {
    "QUERY PLAN": "        Buffers: shared hit=9435 read=11203"
  },
  {
    "QUERY PLAN": "        I/O Timings: read=23.365"
  },
  {
    "QUERY PLAN": "        ->  Seq Scan on table2 mo  (cost=0.00..30930.79 rows=296013 width=14) (actual time=0.037..196.845 rows=296970 loops=1)"
  },
  {
    "QUERY PLAN": "              Filter: (\"createdAt\" < (now() - '05:00:00'::interval))"
  },
  {
    "QUERY PLAN": "              Rows Removed by Filter: 297302"
  },
  {
    "QUERY PLAN": "              Buffers: shared hit=9318 read=11203"
  },
  {
    "QUERY PLAN": "              I/O Timings: read=23.365"
  },
  {
    "QUERY PLAN": "        ->  Hash  (cost=121.62..121.62 rows=462 width=14) (actual time=0.461..0.462 rows=458 loops=1)"
  },
  {
    "QUERY PLAN": "              Buckets: 1024  Batches: 1  Memory Usage: 30kB"
  },
  {
    "QUERY PLAN": "              Buffers: shared hit=117"
  },
  {
    "QUERY PLAN": "              ->  Seq Scan on matches pf  (cost=0.00..121.62 rows=462 width=14) (actual time=0.046..0.343 rows=458 loops=1)"
  },
  {
    "QUERY PLAN": "                    Buffers: shared hit=117"
  },
  {
    "QUERY PLAN": "Planning:"
  },
  {
    "QUERY PLAN": "  Buffers: shared hit=10 read=2"
  },
  {
    "QUERY PLAN": "  I/O Timings: read=0.044"
  },
  {
    "QUERY PLAN": "Planning Time: 0.702 ms"
  },
  {
    "QUERY PLAN": "Execution Time: 248.396 ms"
  }
]

Performance/Data

matches table - will be filled up to 1k records in a life-time.
table2 table - will be filled up to 20 million records on Saturday (throught out all the day). In all other days, the table will be filled at most by 2 million new recods.

To messure the performance of my query, I created a small script that inserts "old" and "new" records. "old" records are expected to be deleted after every run. "new" records are expected to stay.

The amount of "old" and "new" records each inserted in a second is 1k (sum=2k).

I expect the duration of the query to increase as long as there are more "new" records in table2 , but the initial duraiton is slow and the increase rate is too high:

promethues:

initial duration (no-data in table2 ): 0.06 seconds.
current duration (2848877 records in table2 ): 5-7+ seconds and it's increasing...

Indexes

table2 table - multi-column index on (this order): key , createdAt .
cluster index on table2 table - key
matches table - one of the single indexes: key

More Info

key is bigint
createdAt is timestamp with time zone
postgres version: 13.2

Question

What can I do to improve the initial query duration and decrease the increase-rate?

Answer 1

First, something seems suspicious with the design. . .

Deletes every five seconds is suspicious.
Duplicates in the matches table is suspicious.

That suggests that there might be a better way to solve your overall problem -- but you don't explain what you are doing. For instance, you might want a trigger to do deletes.

You only explain the query that you have.

In any case, one index that might help is an index on table2(createdAt) . You seem to have a pretty high insert volume, if you need to run this every 5 seconds. That suggests that load on the server might also be an issue.

Answer 2

If most rows are protected from deletion by being too new, then you can quickly rule those rows out by an index, only scanning the rows older than 5 hours. But if most rows are protected by having matches, there is no way to quickly rule those out with your current design. Each protected row (older than 5 hours if an index on that is used) will need to be visited and assessed every 5 seconds.

Assuming you want to use this general design at all, perhaps you could partition the data. You could have one partition for vulnerable rows (with no matches) and another for matched rows. Then you could have a trigger that moves rows to the vulnerable partition upon deletions of rows from the matches table (if there are no remaining matches)

Optimizing postgres delete query

Question

Goal

What We Did So Far

Performance/Data

Indexes

More Info

Question

2 answers

solution1
0 2021-06-15 15:00:06

solution2
0 2021-06-16 14:07:31

Optimizing postgres delete query

Question

Goal

What We Did So Far

Performance/Data

Indexes

More Info

Question

2 answers

solution1 0 2021-06-15 15:00:06

solution2 0 2021-06-16 14:07:31

solution1
0 2021-06-15 15:00:06

solution2
0 2021-06-16 14:07:31