Next query after large delete is slow. Why does a SELECT trigger WAL file archiving?

Question

What's going on here to cause this please? Specifically

Why is the SELECT, after the DELETE so much slower. I get that it has to now navigate dead tuples but is the cost that much higher?
Why does the SELECT, appear to cause log-file archiving? Unexpected as SELECT shouldn't be generating any WAL?
It appears as though the SELECT waits for log-file archiving before returning an answer. Why?

Environment:

PostgreSQL v 14.2
log_level=debug1. (so I can see archived WAL activity).

I have a streaming replica and a WAL archive directory.

I create a simple table of 50 million rows.

CREATE TABLE IF NOT EXISTS bigt
(
    a integer,
    b integer,
    des text
);

Now populate the table with some data (50 million rows should suffice):

insert into bigt select i,mod(i,4),md5( mod(i,4)::text) from generate_series(1,50000000) i;

Let's see how long it takes to query it (scan it )

select count(*) from bigt;
  count
----------
 50000000
(1 row)

Time: 1459.486 ms (00:01.459)

OK so less than 1.5 secs (Note this gets faster of course, on 2nd and 2rd queries due to caching).

Now I will delete half the rows in that table then re-query the table count and watch what happens:

delete from bigt where a>25000000;
DELETE 25000000
Time: 41669.131 ms (00:41.669)
mydb=# select count(*) from bigt;
  count
----------
 25000000
(1 row)

Time: 22453.483 ms (00:22.453)

WOW. 22.5 seconds, I am tailing the PostgreSQL log file at same time. After the DELETE statement, I wait 10 seconds before running the SELECT. The act of running the SELECT seems to cause a sequence of log line like

DEBUG:  archived write-ahead log file "0000000100000023000000F6"

After 10s of these lines, the SELECT completes, after 22.5 seconds!

UPDATE So just suspecting the SELECT triggers something, I simplified the scenario and ( removing the potential of autovacuum to kick-in during the test) I disabled autovacuum for this test table.

Empty table (Truncated). Same schema as before.

mydb=# ALTER TABLE bigt SET (autovacuum_enabled = off);
ALTER TABLE
Time: 2.436 ms
mydb=# truncate bigt;
TRUNCATE TABLE
Time: 196.077 ms

Insert 30 million rows

mydb=# insert into bigt select i,mod(i,4),md5( mod(i,4)::text) from generate_series(1,30000000) i;
INSERT 0 30000000
Time: 57994.379 ms (00:57.994)

Wait a few seconds, then issue the SELECT (I am tailing the log and this SELECT seems to cause the archiving activity )

This time, I ran it with the explain-analyze output I would normally use

db=# explain (analyze,buffers,verbose) select count(*) from bigt;
                                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=384761.97..384761.98 rows=1 width=8) (actual time=22437.498..22463.793 rows=1 loops=1)
   Output: count(*)
   Buffers: shared read=280374 dirtied=280374 written=94611
   I/O Timings: read=59174.975 write=1503.423
   ->  Gather  (cost=384761.91..384761.96 rows=4 width=8) (actual time=22437.135..22463.785 rows=5 loops=1)
         Output: (PARTIAL count(*))
         Workers Planned: 4
         Workers Launched: 4
         Buffers: shared read=280374 dirtied=280374 written=94611
         I/O Timings: read=59174.975 write=1503.423
         ->  Partial Aggregate  (cost=383761.91..383761.92 rows=1 width=8) (actual time=22410.083..22410.083 rows=1 loops=5)
               Output: PARTIAL count(*)
               Buffers: shared read=280374 dirtied=280374 written=94611
               I/O Timings: read=59174.975 write=1503.423
               Worker 0:  actual time=22403.414..22403.415 rows=1 loops=1
                 Buffers: shared read=55891 dirtied=55891 written=18963
                 I/O Timings: read=11813.538 write=323.569
               Worker 1:  actual time=22403.428..22403.429 rows=1 loops=1
                 Buffers: shared read=55196 dirtied=55196 written=18936
                 I/O Timings: read=11892.729 write=322.704
               Worker 2:  actual time=22403.400..22403.401 rows=1 loops=1
                 Buffers: shared read=55584 dirtied=55584 written=18621
                 I/O Timings: read=11842.255 write=317.909
               Worker 3:  actual time=22403.248..22403.249 rows=1 loops=1
                 Buffers: shared read=55424 dirtied=55424 written=18837
                 I/O Timings: read=11814.539 write=288.189
               ->  Parallel Seq Scan on public.bigt  (cost=0.00..363084.33 rows=8271033 width=0) (actual time=0.354..21911.602 rows=6000000 loops=5)
                     Output: a, b, des
                     Buffers: shared read=280374 dirtied=280374 written=94611
                     I/O Timings: read=59174.975 write=1503.423
                     Worker 0:  actual time=0.174..21909.475 rows=5980319 loops=1
                       Buffers: shared read=55891 dirtied=55891 written=18963
                       I/O Timings: read=11813.538 write=323.569
                     Worker 1:  actual time=0.526..21914.253 rows=5905972 loops=1
                       Buffers: shared read=55196 dirtied=55196 written=18936
                       I/O Timings: read=11892.729 write=322.704
                     Worker 2:  actual time=0.519..21908.078 rows=5947488 loops=1
                       Buffers: shared read=55584 dirtied=55584 written=18621
                       I/O Timings: read=11842.255 write=317.909
                     Worker 3:  actual time=0.525..21909.759 rows=5930368 loops=1
                       Buffers: shared read=55424 dirtied=55424 written=18837
                       I/O Timings: read=11814.539 write=288.189
 Query Identifier: -3522295412005428879
 Planning:
   Buffers: shared hit=7 read=4 dirtied=1
   I/O Timings: read=0.025
 Planning Time: 0.312 ms
 Execution Time: 22463.902 ms
(48 rows)

Time: 22465.325 ms (00:22.465)

The log file extract shows the SELECT and the impact of the SELECT on log-file archiving. The "explain select" seems to trigger lots of archived WAL activity.

2022-08-22 17:13:23.788 BST [26503]  DEBUG:  archived write-ahead log file "00000001000000260000004D"
2022-08-22 17:13:23.997 BST [26503]  DEBUG:  archived write-ahead log file "00000001000000260000004E"
2022-08-22 17:13:24.412 BST [26503]  DEBUG:  archived write-ahead log file "00000001000000260000004F"
2022-08-22 17:13:24.635 BST [26503]  DEBUG:  archived write-ahead log file "000000010000002600000050"
2022-08-22 17:13:25.037 BST [61194] usr LOG:  duration: 57994.310 ms
2022-08-22 17:13:30.715 BST [61194] usr LOG:  statement: explain (analyze,buffers,verbose) select count(*) from bigt;
2022-08-22 17:13:30.716 BST [26496]  DEBUG:  registering background worker "parallel worker for PID 61194"
2022-08-22 17:13:30.716 BST [26496]  DEBUG:  registering background worker "parallel worker for PID 61194"
2022-08-22 17:13:30.716 BST [26496]  DEBUG:  registering background worker "parallel worker for PID 61194"
2022-08-22 17:13:30.716 BST [26496]  DEBUG:  registering background worker "parallel worker for PID 61194"
2022-08-22 17:13:30.716 BST [26496]  DEBUG:  starting background worker process "parallel worker for PID 61194"
2022-08-22 17:13:30.716 BST [26496]  DEBUG:  starting background worker process "parallel worker for PID 61194"
2022-08-22 17:13:30.717 BST [26496]  DEBUG:  starting background worker process "parallel worker for PID 61194"
2022-08-22 17:13:30.717 BST [26496]  DEBUG:  starting background worker process "parallel worker for PID 61194"
2022-08-22 17:13:30.830 BST [26503]  DEBUG:  archived write-ahead log file "000000010000002600000051"
2022-08-22 17:13:30.880 BST [26503]  DEBUG:  archived write-ahead log file "000000010000002600000052"
2022-08-22 17:13:31.067 BST [26503]  DEBUG:  archived write-ahead log file "000000010000002600000053"

I think this is revealing. So I then did a manual vacuum on the table ( This reported removing 0 rows though. I would expect that as I've just inserted 30 million ).

I then repeat the EXPLAIN SELECT and the plan shows that no buffers are "dirtied", this time. And I guess this is the biggest clue.

db=# explain (analyze,buffers,verbose) select count(*) from bigt;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=375124.06..375124.07 rows=1 width=8) (actual time=1030.392..1036.682 rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=65150 read=215224
   I/O Timings: read=495.941
   ->  Gather  (cost=375124.00..375124.05 rows=4 width=8) (actual time=1030.270..1036.676 rows=5 loops=1)
         Output: (PARTIAL count(*))
         Workers Planned: 4
         Workers Launched: 4
         Buffers: shared hit=65150 read=215224
         I/O Timings: read=495.941
         ->  Partial Aggregate  (cost=374124.00..374124.01 rows=1 width=8) (actual time=1010.259..1010.260 rows=1 loops=5)
               Output: PARTIAL count(*)
               Buffers: shared hit=65150 read=215224
               I/O Timings: read=495.941
               Worker 0:  actual time=1005.289..1005.289 rows=1 loops=1
                 Buffers: shared hit=12674 read=42481
                 I/O Timings: read=97.199
               Worker 1:  actual time=1005.322..1005.323 rows=1 loops=1
                 Buffers: shared hit=12833 read=43200
                 I/O Timings: read=99.123
               Worker 2:  actual time=1005.319..1005.320 rows=1 loops=1
                 Buffers: shared hit=12864 read=42989
                 I/O Timings: read=100.243
               Worker 3:  actual time=1005.289..1005.290 rows=1 loops=1
                 Buffers: shared hit=12793 read=43094
                 I/O Timings: read=98.914
               ->  Parallel Seq Scan on public.bigt  (cost=0.00..355374.00 rows=7500000 width=0) (actual time=0.140..613.439 rows=6000000 loops=5)
                     Output: a, b, des
                     Buffers: shared hit=65150 read=215224
                     I/O Timings: read=495.941
                     Worker 0:  actual time=0.204..614.154 rows=5901585 loops=1
                       Buffers: shared hit=12674 read=42481
                       I/O Timings: read=97.199
                     Worker 1:  actual time=0.171..613.932 rows=5995531 loops=1
                       Buffers: shared hit=12833 read=43200
                       I/O Timings: read=99.123
                     Worker 2:  actual time=0.132..613.952 rows=5976271 loops=1
                       Buffers: shared hit=12864 read=42989
                       I/O Timings: read=100.243
                     Worker 3:  actual time=0.179..611.561 rows=5979909 loops=1
                       Buffers: shared hit=12793 read=43094
                       I/O Timings: read=98.914
 Query Identifier: -3522295412005428879
 Planning:
   Buffers: shared hit=1 read=1
   I/O Timings: read=0.663
 Planning Time: 1.231 ms
 Execution Time: 1036.729 ms
(48 rows)

Time: 1040.606 ms (00:01.041)

It is rather strange though that the query still triggers all this WAL archiving activity. It's reliably repeatable. Just didn't get why.

The SELECT is dirtying some buffers and this clue led me to read about the setting of the "visible to all" bit in each row-header. So I need to go read about that as it sounds relevant. Thanks all for help !

Answer 1

Non default settings are: ... wal_log_hints=on,

Well, there is your answer to why you get archiving. Doing a SELECT on a dirty table will set hint bits, and with that setting, it generates WAL, and that WAL needs to be archived.

The SELECT doesn't wait for archival to happen. But by triggering archival, it then has to compete with it for resources. But that may not be the main reason for the slowness. Even if it didn't generate WAL, setting the hint bits still consumes both CPU and IO.

There was a proposal to add a setting to limit how many hint bits a SELECT would be willing to set. But I don't think it ever go accepted. One side was that the SELECT, having done the work to figure out the tuple was not visible to it, should set the bit so that a future SELECT don't have to repeat that determination. The other side is that setting hint bits is better done by autovacuum (which doesn't have someone waiting on it), and so why should a SELECT annoy its customer just to steal part of autovacuum's job from it?

Next query after large delete is slow. Why does a SELECT trigger WAL file archiving?

Question

1 answers

solution1
0 2022-08-22 17:12:01

Next query after large delete is slow. Why does a SELECT trigger WAL file archiving?

Question

1 answers

solution1 0 2022-08-22 17:12:01

solution1
0 2022-08-22 17:12:01