In this post we will terminating the private network interconnect between the two nodes by moving node “rac1” lan interfaces to a different vlan. This will result in still connected network ports but interrupted connectivity.
Terminating the network connection
After terminating the network connection between the two nodes node “rac2” rebooted to preserve cluster integrity. The log files on both nodes document this clearly:
Alertrac1.log
2009-09-30 10:20:12.903 [cssd(4178)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 rac2 . 2009-09-30 10:21:14.920 [crsd(4454)]CRS-2772:Server 'rac2' has been assigned to pool 'Generic'. 2009-09-30 10:21:14.921 [crsd(4454)]CRS-2772:Server 'rac2' has been assigned to pool 'ora.ora11p'. 2009-09-30 11:42:01.860 [cssd(4178)]CRS-1612:Network communication with node rac2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.170 seconds 2009-09-30 11:42:08.930 [cssd(4178)]CRS-1611:Network communication with node rac2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.100 seconds 2009-09-30 11:42:13.977 [cssd(4178)]CRS-1610:Network communication with node rac2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.050 seconds 2009-09-30 11:42:16.044 [cssd(4178)]CRS-1607:Node rac2 is being evicted in cluster incarnation 148255196; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/rac1/cssd/ocssd.log. 2009-09-30 11:42:46.735 [cssd(4178)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 . 2009-09-30 11:42:46.763 [crsd(4454)]CRS-5504:Node down event reported for node 'rac2'. 2009-09-30 11:43:04.524 [ohasd(3698)]CRS-8011:reboot advisory message from host: rac2, component: mo102000, with time stamp: L-2009-09-30-11:42:33.374 [ohasd(3698)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS 2009-09-30 11:43:09.029 [crsd(4454)]CRS-2773:Server 'rac2' has been removed from pool 'Generic'. 2009-09-30 11:43:09.029 [crsd(4454)]CRS-2773:Server 'rac2' has been removed from pool 'ora.ora11p'.
Alertrac2.log
2009-09-30 11:42:01.822 [cssd(3914)]CRS-1612:Network communication with node rac1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.120 seconds 2009-09-30 11:42:08.891 [cssd(3914)]CRS-1611:Network communication with node rac1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 7.050 seconds 2009-09-30 11:42:13.941 [cssd(3914)]CRS-1610:Network communication with node rac1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.000 seconds 2009-09-30 11:42:15.956 [cssd(3914)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log. 2009-09-30 11:42:16.432 [cssd(3914)]CRS-1608:This node was evicted by node 1, rac1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log. 2009-09-30 11:42:16.771 [cssd(3914)]CRS-1608:This node was evicted by node 1, rac1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log.
Starting up the system with private network communication still impossible
Node “rac2” was forced to restart to preserve cluster integrity. Note that communication between the two nodes is still impossible (private network interfaces on both nodes are in separate vlans).
While surviving node “rac1” does not report any problems in the clusterware alert log nor the cssd.log file the rebootet node “rac2” does report a lot:
Alertrac2.log on node “rac2”
2009-09-30 11:45:07.621 [ohasd(3684)]CRS-2112:The OLR service started on node rac2. 2009-09-30 11:45:08.605 [ohasd(3684)]CRS-8011:reboot advisory message from host: rac2, component: ag232314, with time stamp: L-2009-09-29-23:48:27.701 [ohasd(3684)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS 2009-09-30 11:45:08.800 [ohasd(3684)]CRS-8011:reboot advisory message from host: rac2, component: mo220120, with time stamp: L-2009-09-27-21:06:00.250 [ohasd(3684)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS 2009-09-30 11:45:08.951 [ohasd(3684)]CRS-8011:reboot advisory message from host: rac2, component: ag220120, with time stamp: L-2009-09-27-21:06:00.246 [ohasd(3684)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS 2009-09-30 11:45:10.395 [ohasd(3684)]CRS-8011:reboot advisory message from host: rac2, component: ag210904, with time stamp: L-2009-09-28-23:19:04.300 [ohasd(3684)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS 2009-09-30 11:45:10.472 [ohasd(3684)]CRS-8017:location: /etc/oracle/lastgasp has 42 reboot advisory log files, 4 were announced and 0 errors occurred 2009-09-30 11:45:17.136 [ohasd(3684)]CRS-2772:Server 'rac2' has been assigned to pool 'Free'. 2009-09-30 11:45:22.547 [cssd(3949)]CRS-1713:CSSD daemon is started in clustered mode 2009-09-30 11:45:42.982 [cssd(3949)]CRS-1707:Lease acquisition for node rac2 number 2 completed 2009-09-30 11:45:43.152 [cssd(3949)]CRS-1605:CSSD voting file is online: ORCL:DISK002B; details in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log. 2009-09-30 11:45:43.195 [cssd(3949)]CRS-1605:CSSD voting file is online: ORCL:DISK002A; details in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log. 2009-09-30 11:45:43.264 [cssd(3949)]CRS-1605:CSSD voting file is online: ORCL:DISK001B; details in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log.
ocssd.log on node rac2
11:48:29.195: [CSSD][1300728128]clssnmSendingThread: sending join msg to all nodes 11:48:29.195: [CSSD][1300728128]clssnmSendingThread: sent 5 join msgs to all nodes 11:48:29.256: [CSSD][1164360000]clssnmvDHBValidateNCopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 148255197, wrtcnt, 1486324, LATS 4294342800, lastSeqNo 1486324, uniqueness 1254298526, timestamp 1254304108/4745544 11:48:29.386: [CSSD][1227299136]clssnmvDHBValidateNCopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 148255197, wrtcnt, 1486325, LATS 4294342930, lastSeqNo 1486325, uniqueness 1254298526, timestamp 1254304109/4746554 11:48:29.735: [CSSD][1311217984]clssnmRcfgMgrThread: Local Join 11:48:29.735: [CSSD][1311217984]clssnmLocalJoinEvent: begin on node(2), waittime 193000 11:48:29.735: [CSSD][1311217984]clssnmLocalJoinEvent: set curtime (4294343280) for my node 11:48:29.735: [CSSD][1311217984]clssnmLocalJoinEvent: scanning 32 nodes 11:48:29.735: [CSSD][1311217984]clssnmLocalJoinEvent: Node rac1, number 1, is in an existing cluster with disk state 3 11:48:29.735: [CSSD][1311217984]clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk 11:48:30.175: [CSSD][1279748416]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0 11:48:30.266: [CSSD][1164360000]clssnmvDHBValidateNCopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 148255197, wrtcnt, 1486325, LATS 4294343810, lastSeqNo 1486325, uniqueness 1254298526, timestamp 1254304109/4746554 11:48:30.397: [CSSD][1227299136]clssnmvDHBValidateNCopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 148255197, wrtcnt, 1486326, LATS 4294343940, lastSeqNo 1486326, uniqueness 1254298526, timestamp 1254304110/4747564 11:48:31.185: [CSSD][1279748416]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0 11:48:31.276: [CSSD][1164360000]clssnmvDHBValidateNCopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 148255197, wrtcnt, 1486326, LATS 4294344820, lastSeqNo 1486326, uniqueness 1254298526, timestamp 1254304110/4747564 11:48:31.406: [CSSD][1227299136]clssnmvDHBValidateNCopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 148255197, wrtcnt, 1486327, LATS 4294344950, lastSeqNo 1486327, uniqueness 1254298526, timestamp 1254304111/4748574 11:48:32.195: [CSSD][1279748416]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0 11:48:32.288: [CSSD][1164360000]clssnmvDHBValidateNCopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 148255197, wrtcnt, 1486327, LATS 4294345830, lastSeqNo 1486327, uniqueness 1254298526, timestamp 1254304111/4748574 11:48:32.416: [CSSD][1227299136]clssnmvDHBValidateNCopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 148255197, wrtcnt, 1486328, LATS 4294345960, lastSeqNo 1486328, uniqueness 1254298526, timestamp 1254304112/4749584 11:48:33.205: [CSSD][1279748416]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0 11:48:33.296: [CSSD][1164360000]clssnmvDHBValidateNCopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 148255197, wrtcnt, 1486328, LATS 4294346840, lastSeqNo 1486328, uniqueness 1254298526, timestamp 1254304112/4749584 11:48:33.426: [CSSD][1227299136]clssnmvDHBValidateNCopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 148255197, wrtcnt, 1486329, LATS 4294346970, lastSeqNo 1486329, uniqueness 1254298526, timestamp 1254304113/4750594 11:48:34.195: [CSSD][1300728128]clssnmSendingThread: sending join msg to all nodes 11:48:34.195: [CSSD][1300728128]clssnmSendingThread: sent 5 join msgs to all nodes 11:48:34.215: [CSSD][1279748416]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
After waiting several minutes the node “rac2” reported:
2009-09-30 12:05:37.859 [cssd(4612)]CRS-1603:CSSD on node rac2 shutdown by user. 2009-09-30 12:05:37.858 [/u01/app/11.2.0/grid/bin/cssdagent(4596)]CRS-5818:Aborted command 'start for resource: ora.cssd 1 1' for resource 'ora.cssd'. Details at (:CRSAGF00113:) in /u01/app/11.2.0/grid/log/rac2/agent/ohasd/oracssdagent_root/oracssdagent_root.log. 2009-09-30 12:05:42.884 [ohasd(3684)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.cssd'. Details at (:CRSPE00111:) in /u01/app/11.2.0/grid/log/rac2/ohasd/ohasd.log. 2009-09-30 12:05:43.302 [ohasd(3684)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'rac2'. 2009-09-30 12:36:26.398 [ohasd(3684)]CRS-2765:Resource 'ora.crsd' has failed on server 'rac2'.
Although it says “CSSD on node rac2 shutdown by user” there was no manual intervention. CSSD was shut down by the clusterware stack. After crsd failed the system wont try to join the cluster again.
Starting up the system with private interconnect working again
After restoring private interface connection between both nodes the second node successfully joined the cluster after reboot.
Pingback: ASM resilvering – or – how to recover your crashed cluster « Ronny Egner's Blog
Hi Ronny,
I have run a test on 11.2.0.2 RAC where i’ve disconnected the interconnect interface and i found out that after i’ve enabled back the 2nd node it took ~8 min to restart. Have you seen the same behaviour on your tests? I don’t think it’s normal.
Thanks,
dani