admin管理员组文章数量:1565367
问题背景:
RAC环境DB版本:11.2.0.4
因公司搬机柜原因,需要暂时关闭RAC节点,采取的方式是关闭节点一RAC服务和关机,待完成搬动后,启动节点一RAC服务;再关闭 节点二RAC服务和关机 ,待完成搬动后,启动节点二RAC服务;
发生的问题是,节点一完成搬动后成功启动节点一相关服务,此时进行关闭节点二RAC服务和关机操作,待节点二成功关机后,客户可在节点一继续操作,但就在节点二关机不到5分钟,节点一居然自动关闭RAC服务并且重启机器了(此时时间约为2016年1月20日10:15);询问过搬机柜的人员,他们说没有碰到节点一的线,也不清楚为什么发生这样的情况,待节点一重启后,我们开始寻找原因;
一开始个人认为是votedisk将节点一踢出,进行IO隔离从而重启节点一服务器,但是经过查看log,跟votedisk非直接原因,少废话,先看log;
节点一 grid alert-rac1.log
- 2016-01-20 09:38:39.292:
[ctssd(11368)]CRS-2408:The clock on host rac1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
2016-01-20 09:38:57.565:
[crsd(11498)]CRS-1012:The OCR service started on node rac1.
2016-01-20 09:38:57.580:
[evmd(11389)]CRS-1401:EVMD started on node rac1.
2016-01-20 09:38:58.721:
[crsd(11498)]CRS-1201:CRSD started on node rac1.
2016-01-20 10:08:07.231:
[client(16966)]CRS-4743:File /u01/11.2.0/grid/oc4j/j2ee/home/OC4J_DBWLM_config/system-jazn-data.xml was updated from OCR(Size: 13365(New), 13378(Old) bytes)
2016-01-20 10:08:37.992:
[ctssd(11368)]CRS-2407:The new Cluster Time Synchronization Service reference node is host rac1.
2016-01-20 10:08:40.775:
[cssd(11233)]CRS-1625:Node rac2, number 2, was manually shut down
2016-01-20 10:08:40.784:
[cssd(11233)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 .
2016-01-20 10:08:40.786:
[crsd(11498)]CRS-5504:Node down event reported for node 'rac2'.
2016-01-20 10:08:40.787:
[crsd(11498)]CRS-2773:Server 'rac2' has been removed from pool 'Generic'.
2016-01-20 10:08:40.787:
[crsd(11498)]CRS-2773:Server 'rac2' has been removed from pool 'ora.msr'.
2016-01-20 10:15:42.490:
[/u01/11.2.0/grid/bin/orarootagent.bin(11616)]CRS-5822:Agent '/u01/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:5:9} in /u01/11.2.0/grid/log/rac1/agent/crsd/orarootagent_root/orarootagent_root.log.
2016-01-20 10:15:43.416:
[mdnsd(11155)]CRS-5602:mDNS service stopping by request.
2016-01-20 10:15:43.419:
[ctssd(11368)]CRS-2405:The Cluster Time Synchronization Service on host rac1 is shutdown by user
2016-01-20 10:15:53.300:
[cssd(11233)]CRS-1603:CSSD on node rac1 shutdown by user.
2016-01-20 10:15:53.406:
[ohasd(10794)]CRS-2767:Resource state recovery not attempted for 'ora.cssdmonitor' as its target state is OFFLINE
2016-01-20 10:15:53.406:
[ohasd(10794)]CRS-2769:Unable to failover resource 'ora.cssdmonitor'.
2016-01-20 10:15:55.653:
[gpnpd(11165)]CRS-2329:GPNPD on node rac1 shutdown.
2016-01-20 10:21:06.907:
[ohasd(10802)]CRS-2112:The OLR service started on node rac1.
2016-01-20 10:21:07.000:
[ohasd(10802)]CRS-1301:Oracle High Availability Service started on node rac1.
2016-01-20 10:21:07.037:
[ohasd(10802)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2016-01-20 10:21:11.030:
[/u01/11.2.0/grid/bin/orarootagent.bin(10937)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2016-01-20 10:21:15.360:
[gpnpd(11195)]CRS-2328:GPNPD started on node rac1.
2016-01-20 10:21:17.731:
[cssd(11270)]CRS-1713:CSSD daemon is started in clustered mode
2016-01-20 10:21:19.543:
[ohasd(10802)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2016-01-20 10:21:19.551:
[ohasd(10802)]CRS-2769:Unable to failover resource 'ora.diskmon'.
2016-01-20 10:21:37.425:
[cssd(11270)]CRS-1707:Lease acquisition for node rac1 number 1 completed
2016-01-20 10:21:38.766:
[cssd(11270)]CRS-1605:CSSD voting file is online: ORCL:OCR_VOTE03; details in /u01/11.2.0/grid/log/rac1/cssd/ocssd.log.
2016-01-20 10:21:38.773:
[cssd(11270)]CRS-1605:CSSD voting file is online: ORCL:OCR_VOTE02; details in /u01/11.2.0/grid/log/rac1/cssd/ocssd.log.
2016-01-20 10:21:38.784:
[cssd(11270)]CRS-1605:CSSD voting file is online: ORCL:OCR_VOTE01; details in /u01/11.2.0/grid/log/rac1/cssd/ocssd.log.
2016-01-20 10:21:47.897:
[cssd(11270)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 .
2016-01-20 10:21:49.968:
[ctssd(11535)]CRS-2407:The new Cluster Time Synchronization Service reference node is host rac1.
......
可以看到在2016-01-20 10:15:43的时候节点一RAC服务开始停止,但是这并非人为操作;
节点一 DB实例 log
- Reconfiguration complete
Wed Jan 20 10:08:27 2016
Decreasing number of real time LMS from 2 to 0
Wed Jan 20 10:12:13 2016
SKGXP: ospid 11859: network interface with IP address ** no longer running (check cable)
Wed Jan 20 10:15:31 2016
Shutting down instance (abort)
License high water mark = 83
USER (ospid: 19115): terminating the instance
Instance terminated by USER, pid = 19115
Wed Jan 20 10:15:33 2016
Instance shutdown complete
Wed Jan 20 10:22:28 2016
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 4
Number of processor cores in the system is 4
Number of processor sockets in the system is 1
Private Interface 'bond1:1' configured from GPnP for use as a private interconnect.
[name='bond1:1', type=1, ip=**, mac=**, net=**, mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'bond0' configured from GPnP for use as a public interface.
[name='bond0', type=1, ip=**, mac=**, net=**, mask=255.255.255.224, use=public/1]
Public Interface 'bond0:1' configured from GPnP for use as a public interface.
[name='bond0:1', type=1, ip=**, mac=**, net=**, mask=255.255.255.224, use=public/1]
Public Interface 'bond0:2' configured from GPnP for use as a public interface.
[name='bond0:2', type=1, ip=**, mac=**, net=**, mask=255.255.255.224, use=public/1]
Public Interface 'bond0:3' configured from GPnP for use as a public interface.
[name='bond0:3', type=1, ip=**, mac=**, net=**, mask=255.255.255.224, use=public/1]
Public Interface 'bond0:4' configured from GPnP for use as a public interface.
[name='bond0:4', type=1, ip=**, mac=**, net=**, mask=255.255.255.224, use=public/1]
Public Interface 'bond0:5' configured from GPnP for use as a public interface.
[name='bond0:5', type=1, ip=**, mac=**, net=**, mask=255.255.255.224, use=public/1]
CELL communication is configured to use 0 interface(s):
CELL IP affinity details:
NUMA status: non-NUMA system
cellaffinity.ora status: N/A
CELL communication will use 1 IP group(s):
Grp 0:
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST
WARNING: db_recovery_file_dest is same as db_create_online_log_dest_2
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters option.
ORACLE_HOME = /u01/app/oracle/product/11.2.0/db_1
System name: Linux
......
在2016-01-20 10:15:31 节点一DB实例被关闭;此时觉得是否真的有人人为关闭DB实例并重启?经查,在这段时间里,只有我这台机器的IP登录过节点一的服务器,但实际上我在这个时刻并没有对节点一做任何操作,此时我们再去看节点一的ASM log;
节点一 +ASM1实例 log
-
Reconfiguration complete
Wed Jan 20 10:12:35 2016
SKGXP: ospid 11426: network interface with IP address ** no longer running (check cable)
Wed Jan 20 10:15:31 2016
SQL> ALTER DISKGROUP OCR_VOTE DISMOUNT /* asm agent *//* {1:57967:201} */
Wed Jan 20 10:15:33 2016
SQL> ALTER DISKGROUP DBFRA DISMOUNT /* asm agent *//* {1:57967:201} */
Wed Jan 20 10:15:33 2016
NOTE: ASM client RACDB1:RACDB disconnected unexpectedly.
NOTE: check client alert log.
Wed Jan 20 10:15:33 2016
SQL> ALTER DISKGROUP DBDATA DISMOUNT /* asm agent *//* {1:57967:201} */
NOTE: Trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_11891.trc
NOTE: cache dismounting (clean) group 2/0x82B9443F (DBFRA)
NOTE: cache dismounting (clean) group 1/0x82B9443E (DBDATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 11666, image: oracle@rac1 (TNS V1-V3)
NOTE: messaging CKPT to quiesce pins Unix process pid: 11668, image: oracle@rac1 (TNS V1-V3)
Wed Jan 20 10:15:33 2016
NOTE: LGWR doing clean dismount of group 2 (DBFRA)
NOTE: LGWR closing thread 1 of diskgroup 2 (DBFRA) at ABA 27.10488
NOTE: LGWR released thread recovery enqueue
NOTE: LGWR doing clean dismount of group 1 (DBDATA)
NOTE: LGWR closing thread 1 of diskgroup 1 (DBDATA) at ABA 16.3387
NOTE: LGWR released thread recovery enqueue
Wed Jan 20 10:15:33 2016
freeing rdom 2
Wed Jan 20 10:15:34 2016
NOTE: detached from domain 2
NOTE: cache dismounted group 2/0x82B9443F (DBFRA)
Wed Jan 20 10:15:34 2016
GMON dismounting group 2 at 10 for pid 30, osid 11666
NOTE: Disk ASM_FRA01 in mode 0x7f marked for de-assignment
SUCCESS: diskgroup DBFRA was dismounted
NOTE: cache deleting context for group DBFRA 2/0x82b9443f
SUCCESS: ALTER DISKGROUP DBFRA DISMOUNT /* asm agent *//* {1:57967:201} */
SQL> ALTER DISKGROUP DBFRA DISMOUNT /* asm agent *//* {1:57967:201} */
ORA-15032: not all alterations performed
ORA-15001: diskgroup "DBFRA" does not exist or is not mounted
ERROR: ALTER DISKGROUP DBFRA DISMOUNT /* asm agent *//* {1:57967:201} */
SQL> ALTER DISKGROUP DBFRA DISMOUNT /* asm agent *//* {1:57967:201} */
ORA-15032: not all alterations performed
ORA-15001: diskgroup "DBFRA" does not exist or is not mounted
ERROR: ALTER DISKGROUP DBFRA DISMOUNT /* asm agent *//* {1:57967:201} */
SQL> ALTER DISKGROUP DBFRA DISMOUNT /* asm agent *//* {1:57967:201} */
ORA-15032: not all alterations performed
ORA-15001: diskgroup "DBFRA" does not exist or is not mounted
ERROR: ALTER DISKGROUP DBFRA DISMOUNT /* asm agent *//* {1:57967:201} */
SQL> ALTER DISKGROUP DBFRA DISMOUNT /* asm agent *//* {1:57967:201} */
ORA-15032: not all alterations performed
ORA-15001: diskgroup "DBFRA" does not exist or is not mounted
ERROR: ALTER DISKGROUP DBFRA DISMOUNT /* asm agent *//* {1:57967:201} */
SQL> ALTER DISKGROUP DBFRA DISMOUNT /* asm agent *//* {1:57967:201} */
ORA-15032: not all alterations performed
ORA-15001: diskgroup "DBFRA" does not exist or is not mounted
ERROR: ALTER DISKGROUP DBFRA DISMOUNT /* asm agent *//* {1:57967:201} */
freeing rdom 1
NOTE: detached from domain 1
NOTE: cache dismounted group 1/0x82B9443E (DBDATA)
GMON dismounting group 1 at 11 for pid 31, osid 11668
NOTE: Disk ASM_DATA01 in mode 0x7f marked for de-assignment
SUCCESS: diskgroup DBDATA was dismounted
NOTE: cache deleting context for group DBDATA 1/0x82b9443e
SUCCESS: ALTER DISKGROUP DBDATA DISMOUNT /* asm agent *//* {1:57967:201} */
SQL> ALTER DISKGROUP DBDATA DISMOUNT /* asm agent *//* {1:57967:201} */
ORA-15032: not all alterations performed
ORA-15001: diskgroup "DBDATA" does not exist or is not mounted
ERROR: ALTER DISKGROUP DBDATA DISMOUNT /* asm agent *//* {1:57967:201} */
SQL> ALTER DISKGROUP DBDATA DISMOUNT /* asm agent *//* {1:57967:201} */
ORA-15032: not all alterations performed
ORA-15001: diskgroup "DBDATA" does not exist or is not mounted
ERROR: ALTER DISKGROUP DBDATA DISMOUNT /* asm agent *//* {1:57967:201} */
SQL> ALTER DISKGROUP DBDATA DISMOUNT /* asm agent *//* {1:57967:201} */
ORA-15032: not all alterations performed
ORA-15001: diskgroup "DBDATA" does not exist or is not mounted
ERROR: ALTER DISKGROUP DBDATA DISMOUNT /* asm agent *//* {1:57967:201} */
SQL> ALTER DISKGROUP DBDATA DISMOUNT /* asm agent *//* {1:57967:201} */
ORA-15032: not all alterations performed
ORA-15001: diskgroup "DBDATA" does not exist or is not mounted
ERROR: ALTER DISKGROUP DBDATA DISMOUNT /* asm agent *//* {1:57967:201} */
SQL> ALTER DISKGROUP DBDATA DISMOUNT /* asm agent *//* {1:57967:201} */
ORA-15032: not all alterations performed
ORA-15001: diskgroup "DBDATA" does not exist or is not mounted
ERROR: ALTER DISKGROUP DBDATA DISMOUNT /* asm agent *//* {1:57967:201} */
Wed Jan 20 10:15:41 2016
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 3
Wed Jan 20 10:15:41 2016
ORA-15032: not all alterations performed
ORA-15027: active use of diskgroup "OCR_VOTE" precludes its dismount
ERROR: ALTER DISKGROUP OCR_VOTE DISMOUNT /* asm agent *//* {1:57967:201} */
Wed Jan 20 10:15:42 2016
NOTE: client exited [11498]
Wed Jan 20 10:15:43 2016
NOTE: ASMB process exiting due ASM instance shutdown (inactive for 1 seconds)
Wed Jan 20 10:15:43 2016
Shutting down instance (immediate)
Shutting down instance: further logons disabled
Stopping background process MMNL
Stopping background process MMON
License high water mark = 12
SQL> ALTER DISKGROUP ALL DISMOUNT /* asm agent *//* {0:0:169} */
NOTE: cache dismounting (clean) group 3/0x82C94440 (OCR_VOTE)
NOTE: messaging CKPT to quiesce pins Unix process pid: 11495, image: oracle@rac1 (TNS V1-V3)
Wed Jan 20 10:15:45 2016
NOTE: LGWR doing clean dismount of group 3 (OCR_VOTE)
NOTE: LGWR closing thread 1 of diskgroup 3 (OCR_VOTE) at ABA 9.73
NOTE: LGWR released thread recovery enqueue
Wed Jan 20 10:15:45 2016
freeing rdom 3
Wed Jan 20 10:15:45 2016
NOTE: detached from domain 3
NOTE: cache dismounted group 3/0x82C94440 (OCR_VOTE)
Wed Jan 20 10:15:45 2016
GMON dismounting group 3 at 13 for pid 23, osid 11495
NOTE: Disk OCR_VOTE01 in mode 0x7f marked for de-assignment
NOTE: Disk OCR_VOTE02 in mode 0x7f marked for de-assignment
NOTE: Disk OCR_VOTE03 in mode 0x7f marked for de-assignment
SUCCESS: diskgroup OCR_VOTE was dismounted
NOTE: cache deleting context for group OCR_VOTE 3/0x82c94440
SUCCESS: ALTER DISKGROUP ALL DISMOUNT /* asm agent *//* {0:0:169} */
Wed Jan 20 10:15:46 2016
Stopping background process VKTM
Wed Jan 20 10:15:48 2016
freeing rdom 0
Wed Jan 20 10:15:52 2016
Instance shutdown complete
......
在2016-01-20 10:15:33 节点一ASM实例出现disconnected unexpectedly,猜测共享存储是否出现问题,导致节点重启?????此时我们再看OS层面上的一份log;
/var/log/messages
节点一 /var/log/messages
- Jan 20 10:10:39 rac1 kernel: pcieport 0000:00:02.0: eth2: Link is down
Jan 20 10:10:40 rac1 kernel: bonding: bond1: link status definitely down for interface eth2, disabling it
Jan 20 10:10:40 rac1 kernel: bonding: bond1: making interface eth3 the new active one.
Jan 20 10:10:40 rac1 kernel: pcieport 0000:00:02.0: eth3: Link is down
Jan 20 10:10:40 rac1 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it
Jan 20 10:10:40 rac1 kernel: bonding: bond1: now running without any active interface !
Jan 20 10:10:42 rac1 kernel: pcieport 0000:00:02.0: eth2: Link is up at 1000 Mbps, full duplex
Jan 20 10:10:42 rac1 kernel: pcieport 0000:00:02.0: eth2: Flow control is on for TX and on for RX
Jan 20 10:10:42 rac1 kernel: pcieport 0000:00:02.0: eth2: EEE is disabled
Jan 20 10:10:43 rac1 kernel: bonding: bond1: link status definitely up for interface eth2, 1000 Mbps full duplex.
Jan 20 10:10:43 rac1 kernel: bonding: bond1: making interface eth2 the new active one.
Jan 20 10:10:43 rac1 kernel: bonding: bond1: first active interface up!
Jan 20 10:10:43 rac1 kernel: pcieport 0000:00:02.0: eth3: Link is up at 1000 Mbps, full duplex
Jan 20 10:10:43 rac1 kernel: pcieport 0000:00:02.0: eth3: Flow control is on for TX and on for RX
Jan 20 10:10:43 rac1 kernel: pcieport 0000:00:02.0: eth3: EEE is disabled
Jan 20 10:10:43 rac1 kernel: bonding: bond1: link status definitely up for interface eth3, 1000 Mbps full duplex.
Jan 20 10:10:45 rac1 kernel: pcieport 0000:00:02.0: eth3: Link is down
Jan 20 10:10:46 rac1 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it
Jan 20 10:10:46 rac1 kernel: pcieport 0000:00:02.0: eth2: Link is down
Jan 20 10:10:46 rac1 kernel: bonding: bond1: link status definitely down for interface eth2, disabling it
Jan 20 10:10:46 rac1 kernel: bonding: bond1: now running without any active interface !
Jan 20 10:10:48 rac1 kernel: pcieport 0000:00:02.0: eth3: Link is up at 100 Mbps, full duplex
Jan 20 10:10:48 rac1 kernel: pcieport 0000:00:02.0: eth3: Flow control is on for TX and on for RX
Jan 20 10:10:48 rac1 kernel: pcieport 0000:00:02.0: eth3: EEE is disabled
Jan 20 10:10:48 rac1 kernel: bonding: bond1: link status definitely up for interface eth3, 100 Mbps full duplex.
Jan 20 10:10:48 rac1 kernel: bonding: bond1: making interface eth3 the new active one.
Jan 20 10:10:48 rac1 kernel: bonding: bond1: first active interface up!
Jan 20 10:10:48 rac1 kernel: pcieport 0000:00:02.0: eth2: Link is up at 100 Mbps, full duplex
Jan 20 10:10:48 rac1 kernel: pcieport 0000:00:02.0: eth2: Flow control is on for TX and on for RX
Jan 20 10:10:48 rac1 kernel: pcieport 0000:00:02.0: eth2: EEE is disabled
Jan 20 10:10:48 rac1 kernel: bonding: bond1: link status definitely up for interface eth2, 100 Mbps full duplex.
Jan 20 10:10:48 rac1 kernel: bonding: bond1: making interface eth2 the new active one.
Jan 20 10:11:39 rac1 kernel: pcieport 0000:00:02.0: eth3: Link is down
Jan 20 10:11:39 rac1 kernel: pcieport 0000:00:02.0: eth2: Link is down
Jan 20 10:11:39 rac1 kernel: bonding: bond1: link status definitely down for interface eth2, disabling it
Jan 20 10:11:39 rac1 kernel: bonding: bond1: now running without any active interface !
Jan 20 10:11:39 rac1 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it
Jan 20 10:15:31 rac1 init: oracle-tfa main process (9074) killed by TERM signal
Jan 20 10:15:31 rac1 init: oracle-ohasd main process (9075) killed by TERM signal
Jan 20 10:15:31 rac1 init: tty (/dev/tty2) main process (10827) killed by TERM signal
Jan 20 10:15:31 rac1 init: tty (/dev/tty3) main process (10829) killed by TERM signal
Jan 20 10:15:31 rac1 init: tty (/dev/tty4) main process (10831) killed by TERM signal
Jan 20 10:15:31 rac1 init: tty (/dev/tty5) main process (10833) killed by TERM signal
Jan 20 10:15:31 rac1 init: tty (/dev/tty6) main process (10835) killed by TERM signal
Jan 20 10:15:58 rac1 abrtd: Got signal 15, exiting
Jan 20 10:16:01 rac1 acpid: exiting
Jan 20 10:16:02 rac1 multipathd: dm-7: remove map (uevent)
Jan 20 10:16:02 rac1 multipathd: dm-7: devmap not registered, can't remove
Jan 20 10:16:02 rac1 multipathd: dm-7: remove map (uevent)
Jan 20 10:16:02 rac1 multipathd: dm-7: devmap not registered, can't remove
Jan 20 10:16:02 rac1 multipathd: MEG1_OracleOS_Lun0: event checker exit
Jan 20 10:16:02 rac1 multipathd: dm-2: remove map (uevent)
Jan 20 10:16:02 rac1 multipathd: dm-2: devmap not registered, can't remove
......
问题已经很明显了,是因为光纤网线连接交换机、共享存储时出现问题;
可能因为搬机柜时人员不小心起飞脚碰到网线,导致网线松动原因,从而使节点一无法连接到共享存储导致节点重启;
来自 “ ITPUB博客 ” ,链接:http://blog.itpub/30220976/viewspace-1979652/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub/30220976/viewspace-1979652/
版权声明:本文标题:记一次RAC节点连接共享存储失败重启问题 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://www.elefans.com/dianzi/1725886796a1047106.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论