Postgres 11 Standby never catches up

Question

Since upgrading to Postgres 11 I cannot get my production standby server to catch up. In the logs things look fine eventually:

2019-02-06 19:23:53.659 UTC [14021] LOG:  consistent recovery state reached at 3C772/8912C508
2019-02-06 19:23:53.660 UTC [13820] LOG:  database system is ready to accept read only connections
2019-02-06 19:23:53.680 UTC [24261] LOG:  started streaming WAL from primary at 3C772/8A000000 on timeline 1

But the following queries show everything is not fine:

warehouse=# SELECT coalesce(abs(pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn())), -1) / 1024 / 1024 / 1024 AS replication_delay_gbytes;
 replication_delay_gbytes
-------------------------
    208.2317776754498486
(1 row)

warehouse=# select now() - pg_last_xact_replay_timestamp() AS replication_delay;
 replication_delay
-------------------
 01:54:19.150381
(1 row)

After a while (a couple hours) replication_delay stays about the same but replication_delay_gbytes grows, although note replication_delay is behind from the beginning and replication_delay_gbytes starts near 0 . During startup there were a number of these messages:

2019-02-06 18:24:36.867 UTC [14036] WARNING:  xlog min recovery request 3C734/FA802AA8 is past current point 3C700/371ED080
2019-02-06 18:24:36.867 UTC [14036] CONTEXT:  writing block 0 of relation base/16436/2106308310_vm

but Googling suggests these are fine.

Replica was created using repmgr by running pg_basebackup to perform the clone and then starting up the replica and seeing it catch up. This previously was working with Postgres 10.

Any thoughts on why this replica comes up but is perpetually lagging?

Answer 1

I'm still not sure what the issue is/was, but I was able to get the standby caught up with these two changes:

set use_replication_slots=true in the repmgr config
set wal_compression=on in the postgres config

Use replication slots didn't seem to change anything other than to cause replication_delay_gbytes to stay roughly flat. Turing on WAL compression did help, somehow, although I'm not entirely sure how. Yes, in theory it made it possible to ship WAL files to the standby faster, but reviewing network logs I see a drop in sent/received bytes that matches the effects of compression, so it seems to be shipping WAL files at the same speed just using less network.

It still seems like there is some underlying issue at play here, though, because for example when I do pg_basebackup to create the standby it generates roughly 500 MB/s of network traffic, but then when it is streaming WALs after the standby finishes recovery it drops to ~250 MB/s without WAL compression and ~100 MB/s with WAL compression, but there is no decrease in network traffic after it caught up with WAL compression, so I'm not sure what's going on there that allowed it to catch up.

Postgres 11 Standby never catches up

Question

1 answers

solution1
0 ACCPTED 2019-02-08 18:32:52

Postgres 11 Standby never catches up

Question

1 answers

solution1 0 ACCPTED 2019-02-08 18:32:52

solution1
0 ACCPTED 2019-02-08 18:32:52