簡體   English   中英

Postgres HA(基於WAL裝運)失敗

[英]Postgres HA (based on WAL-shipping) fails

我希望有人可以幫助我解決WAL運送和待機問題。 我的備用系統愉快地運行了數周,然后突然開始尋找不存在的.history文件。 然后它突然消失了,如果不重建備用數據庫,我將無法成功重啟它。

兩個系統都運行CentOS 4.5和Postgres 8.4.1。 他們使用NFS在備用數據庫上存儲生產中的WAL文件。

日志的相關部分,並附有我的評論:

[** Recovery is running normally **]

Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000001000000830000005B
WAL file path           : /var/tafkan_backup_from_db1/00000001000000830000005B
Restoring to            : pg_xlog/RECOVERYXLOG
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000001000000830000005B" "pg_xlog/RECOVERYXLOG"
Keep archive history    : 00000001000000830000004D and later
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
running restore         : OK

Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000001000000830000005B
WAL file path           : /var/tafkan_backup_from_db1/00000001000000830000005B
Restoring to            : pg_xlog/RECOVERYXLOG
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000001000000830000005B" "pg_xlog/RECOVERYXLOG"
Keep archive history    : 000000000000000000000000 and later
running restore         : OK

[** All of a sudden it starts looks for .history files **]

Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000002.history
WAL file path           : /var/tafkan_backup_from_db1/00000002.history
Restoring to            : pg_xlog/RECOVERYHISTORY
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000002.history" "pg_xlog/RECOVERYHISTORY"
Keep archive history    : 000000000000000000000000 and later
running restore         :cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
not restored
history file not found
Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000001.history
WAL file path           : /var/tafkan_backup_from_db1/00000001.history
Restoring to            : pg_xlog/RECOVERYHISTORY
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000001.history" "pg_xlog/RECOVERYHISTORY"
Keep archive history    : 000000000000000000000000 and later
running restore         :cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory
not restored
history file not found

[** I stopped Postgres, renamed recovery.done to recovery.conf, and restarted it. **]

Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000002.history
WAL file path           : /var/tafkan_backup_from_db1/00000002.history
Restoring to            : pg_xlog/RECOVERYHISTORY
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000002.history" "pg_xlog/RECOVERYHISTORY"
Keep archive history    : 000000000000000000000000 and later
running restore         :cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
not restored
history file not found
Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 0000000200000083000000A2
WAL file path           : /var/tafkan_backup_from_db1/0000000200000083000000A2
Restoring to            : pg_xlog/RECOVERYXLOG
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/0000000200000083000000A2" "pg_xlog/RECOVERYXLOG"
Keep archive history    : 000000000000000000000000 and later
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...

[** This file is not present. All WAL files start with 00000001. **] 

有任何想法嗎? 我什至不知道什么是.history文件,並且(大多數情況下)文檔對此並不十分清楚。

PS。 我希望我正在運行VM,因此我可以使用鏈接文本 ,而不必擔心任何此應用程序級別的HA廢話:-)

更新:這是大約此時備用服務器的一些日志。 看起來是什么原因導致服務器停止恢復並聯機,但是我不知道是什么。 我可以肯定沒有任何東西可以創建觸發文件。

2010-01-20 03:30:15 EST 4b3a5c63.401b LOG:  restored log file "00000001000000830000005A" from archive
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG:  restored log file "00000001000000830000005B" from archive
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG:  record with zero length at 83/5BFA2FF8
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG:  redo done at 83/5BFA2FAC
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG:  last completed transaction was at log time 2010-01-20 03:28:04.594399-05
2010-01-20 03:30:25 EST 4b3a5c63.401b LOG:  restored log file "00000001000000830000005B" from archive
2010-01-20 03:30:37 EST 4b3a5c63.401b LOG:  selected new timeline ID: 2
2010-01-20 03:30:49 EST 4b3a5c63.401b LOG:  archive recovery complete
2010-01-20 03:30:59 EST 4b3a5c62.4019 LOG:  database system is ready to accept connections

HA的完全不同的方法可能是將PG數據庫托管在兩台計算機之間共享的DRBD設備上。

您是否使用了自己的恢復腳本/程序? 如果是,請不要這樣做。 使用來自PostgreSQL contrib的pg_standby。

否則-只需忽略.history文件即可。

您復制的副本有時已聯機。 “ 00000002.history”正在尋找時間軸00000002的歷史記錄文件,而其余日志以00000001(原始時間軸)開頭。

我會在開始查找歷史記錄文件之前檢查一下PostgreSQL日志,以查看是否有任何跡象表明數據庫已經聯機,即使有一段時間。

我可以通過更新兩個PostgreSQL服務器上的CentOS操作系統來解決此問題。 因此,我認為這是某種潛在的網絡錯誤的征兆。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM