我正在使用PostgreSQL 9.1(1个主服务器,3个从服务器)运行流式复制环境。 aprox的一切都运行良好。 2个月。 昨天,对其中一个从服务器的复制失败,从服务器上的日志有:
LOG: incorrect resource manager data checksum in record at 61/DA2710A7 FATAL: terminating walreceiver process due to administrator command LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: incorrect resource manager data checksum in record at 61/DA2710A7奴隶不再与主人同步。 两个小时后 ,日志每隔5秒就会获得一条新线,我重新启动了从数据库服务器:
LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: received fast shutdown request LOG: aborting any active transactions LOG: incorrect resource manager data checksum in record at 61/DA2710A7 FATAL: terminating connection due to administrator command FATAL: terminating connection due to administrator command LOG: shutting down LOG: database system is shut down从站上的新日志文件包含:
LOG: database system was shut down in recovery at 2016-02-29 05:12:11 CET LOG: entering standby mode LOG: redo starts at 61/D92C10C9 LOG: consistent recovery state reached at 61/DA2710A7 LOG: database system is ready to accept read only connections LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: streaming replication successfully connected to primary现在,从机与主机同步,但校验和条目仍然存在。 我检查的另一件事是网络日志 - >网络可用。
我的问题是:
有谁知道为什么walreceiver被终止了? 为什么PostgreSQL没有重试复制? 我可以做些什么来防止将来出现这种情况?谢谢。
编辑:
数据库服务器在带有ext3的SLES 11上运行。 我发现了一篇关于SLES 11具有大RAM的低性能的文章,但我不确定它是否适用,因为我的机器只有8 GB RAM( https://www.novell.com/support/kb/doc.php?id= 7010287 )
任何帮助,将不胜感激。
编辑(2):
PostgreSQL版本是9.1.5。 似乎PostgreSQL版本9.1.6提供了类似问题的修复程序?
Fix persistence marking of shared buffers during WAL replay (Jeff Davis) This mistake can result in buffers not being written out during checkpoints, resulting in data corruption if the server later crashes without ever having written those buffers. Corruption can occur on any server following crash recovery, but it is significantly more likely to occur on standby slave servers since those perform much more WAL replay.资料来源: http : //www.postgresql.org/docs/9.1/static/release-9-1-6.html
这可能是解决方法吗? 我应该升级到PostgreSQL 9.1.6,一切都会顺利运行吗?
I am running a streaming replication environment with PostgreSQL 9.1 (1 master, 3 slaves). Everything worked fine for aprox. 2 months. Yesterday, the replication to one of the slaves failed with the log on the slave having:
LOG: incorrect resource manager data checksum in record at 61/DA2710A7 FATAL: terminating walreceiver process due to administrator command LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: incorrect resource manager data checksum in record at 61/DA2710A7The slave was no longer in sync with the master. Two hours later, in which the log gets a new line like above every 5 seconds, I restarted the slave database server:
LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: received fast shutdown request LOG: aborting any active transactions LOG: incorrect resource manager data checksum in record at 61/DA2710A7 FATAL: terminating connection due to administrator command FATAL: terminating connection due to administrator command LOG: shutting down LOG: database system is shut downThe new log file on the slave contains:
LOG: database system was shut down in recovery at 2016-02-29 05:12:11 CET LOG: entering standby mode LOG: redo starts at 61/D92C10C9 LOG: consistent recovery state reached at 61/DA2710A7 LOG: database system is ready to accept read only connections LOG: incorrect resource manager data checksum in record at 61/DA2710A7 LOG: streaming replication successfully connected to primaryNow the slave is in sync with the master but the checksum entry is still there. One more thing I checked were the network logs -> the network was available.
My questions are:
Does anyone know why the walreceiver was terminated? Why didn't PostgreSQL retry the replication? What can I do to prevent this in the future?Thank you.
EDIT:
The database servers are running on SLES 11 with ext3. I found an article about low performance of SLES 11 with large RAM but I am not sure if it applies since my machine has only 8 GB RAM (https://www.novell.com/support/kb/doc.php?id=7010287)
Any help would be appreciated.
EDIT (2):
PostgreSQL version is 9.1.5. Seem that PostgreSQL version 9.1.6 provides a fix for similar issue?
Fix persistence marking of shared buffers during WAL replay (Jeff Davis) This mistake can result in buffers not being written out during checkpoints, resulting in data corruption if the server later crashes without ever having written those buffers. Corruption can occur on any server following crash recovery, but it is significantly more likely to occur on standby slave servers since those perform much more WAL replay.Source: http://www.postgresql.org/docs/9.1/static/release-9-1-6.html
Might this be the fix? Should I upgrade to PostgreSQL 9.1.6 and everything would run smooth?
最满意答案
如果有人偶然发现这个问题,我最终会从备份数据重新安装数据库并再次设置复制。 从来没有真正弄清楚出了什么问题。
In case someone stumbles across this question, I ended up reinstalling the databases from backed-up data and set up replication again. Never really figured out what went wrong.
更多推荐
发布评论