ASM resilvering – or – how to recover your crashed cluster – Test no 2

In this post we will terminating the private network interconnect between the two nodes by moving node “rac1” lan interfaces to a different vlan. This will result in still connected network ports but interrupted connectivity.

Terminating the network connection

After terminating the network connection between the two nodes node “rac2” rebooted to preserve cluster integrity. The log files on both nodes document this clearly:

Alertrac1.log

2009-09-30 10:20:12.903
[cssd(4178)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 rac2 .
2009-09-30 10:21:14.920
[crsd(4454)]CRS-2772:Server 'rac2' has been assigned to pool 'Generic'.
2009-09-30 10:21:14.921
[crsd(4454)]CRS-2772:Server 'rac2' has been assigned to pool 'ora.ora11p'.
2009-09-30 11:42:01.860
[cssd(4178)]CRS-1612:Network communication with node rac2 (2) missing
for 50% of timeout
interval.  Removal of this node from cluster in 14.170 seconds
2009-09-30 11:42:08.930
[cssd(4178)]CRS-1611:Network communication with node rac2 (2) missing
for 75% of timeout
interval.  Removal of this node from cluster in 7.100 seconds
2009-09-30 11:42:13.977
[cssd(4178)]CRS-1610:Network communication with node rac2 (2) missing
for 90% of timeout
interval.  Removal of this node from cluster in 2.050 seconds
2009-09-30 11:42:16.044
[cssd(4178)]CRS-1607:Node rac2 is being evicted in cluster
incarnation 148255196; details
at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/rac1/cssd/ocssd.log.
2009-09-30 11:42:46.735
[cssd(4178)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 .
2009-09-30 11:42:46.763
[crsd(4454)]CRS-5504:Node down event reported for node 'rac2'.
2009-09-30 11:43:04.524
[ohasd(3698)]CRS-8011:reboot advisory message from host: rac2,
component: mo102000,
with time stamp: L-2009-09-30-11:42:33.374
[ohasd(3698)]CRS-8013:reboot advisory message text:
clsnomon_status: need to reboot,
unexpected failure 8 received from CSS
2009-09-30 11:43:09.029
[crsd(4454)]CRS-2773:Server 'rac2' has been removed from pool 'Generic'.
2009-09-30 11:43:09.029
[crsd(4454)]CRS-2773:Server 'rac2' has been removed from pool 'ora.ora11p'.

Alertrac2.log

2009-09-30 11:42:01.822
[cssd(3914)]CRS-1612:Network communication with node rac1 (1) missing
for 50% of timeout
interval.  Removal of this node from cluster in 14.120 seconds
2009-09-30 11:42:08.891
[cssd(3914)]CRS-1611:Network communication with node rac1 (1) missing
for 75% of timeout
interval.  Removal of this node from cluster in 7.050 seconds
2009-09-30 11:42:13.941
[cssd(3914)]CRS-1610:Network communication with node rac1 (1) missing
for 90% of timeout
interval.  Removal of this node from cluster in 2.000 seconds
2009-09-30 11:42:15.956
[cssd(3914)]CRS-1609:This node is unable to communicate with other
nodes in the cluster
and is going down to preserve cluster integrity; details at (:CSSNM00008:)
in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log.
2009-09-30 11:42:16.432
[cssd(3914)]CRS-1608:This node was evicted by node 1, rac1;
details at (:CSSNM00005:)
in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log.
2009-09-30 11:42:16.771
[cssd(3914)]CRS-1608:This node was evicted by node 1, rac1;
details at (:CSSNM00005:)
in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log.

Starting up the system with private network communication still impossible

Node “rac2” was forced to restart to preserve cluster integrity. Note that communication between the two nodes is still impossible (private network interfaces on both nodes are in separate vlans).

While surviving node “rac1” does not report any problems in the clusterware alert log nor the cssd.log file the rebootet node “rac2” does report a lot:

Alertrac2.log on node “rac2”

2009-09-30 11:45:07.621
[ohasd(3684)]CRS-2112:The OLR service started on node rac2.
2009-09-30 11:45:08.605
[ohasd(3684)]CRS-8011:reboot advisory message from host: rac2,
component: ag232314,
with time stamp: L-2009-09-29-23:48:27.701
[ohasd(3684)]CRS-8013:reboot advisory message text: clsnomon_status:
need to reboot,
 unexpected failure 8 received from CSS
2009-09-30 11:45:08.800
[ohasd(3684)]CRS-8011:reboot advisory message from host: rac2,
component: mo220120,
with time stamp: L-2009-09-27-21:06:00.250
[ohasd(3684)]CRS-8013:reboot advisory message text: clsnomon_status:
need to reboot,
unexpected failure 8 received from CSS
2009-09-30 11:45:08.951
[ohasd(3684)]CRS-8011:reboot advisory message from host: rac2,
component: ag220120,
with time stamp: L-2009-09-27-21:06:00.246
[ohasd(3684)]CRS-8013:reboot advisory message text: clsnomon_status:
need to reboot,
unexpected failure 8 received from CSS
2009-09-30 11:45:10.395
[ohasd(3684)]CRS-8011:reboot advisory message from host: rac2,
component: ag210904,
with time stamp: L-2009-09-28-23:19:04.300
[ohasd(3684)]CRS-8013:reboot advisory message text: clsnomon_status:
need to reboot,
unexpected failure 8 received from CSS
2009-09-30 11:45:10.472
[ohasd(3684)]CRS-8017:location: /etc/oracle/lastgasp has 42 reboot
advisory log files,
4 were announced and 0 errors occurred
2009-09-30 11:45:17.136
[ohasd(3684)]CRS-2772:Server 'rac2' has been assigned to pool 'Free'.
2009-09-30 11:45:22.547
[cssd(3949)]CRS-1713:CSSD daemon is started in clustered mode
2009-09-30 11:45:42.982
[cssd(3949)]CRS-1707:Lease acquisition for node rac2 number 2 completed
2009-09-30 11:45:43.152
[cssd(3949)]CRS-1605:CSSD voting file is online: ORCL:DISK002B; details in
/u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log.
2009-09-30 11:45:43.195
[cssd(3949)]CRS-1605:CSSD voting file is online: ORCL:DISK002A; details in
/u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log.
2009-09-30 11:45:43.264
[cssd(3949)]CRS-1605:CSSD voting file is online: ORCL:DISK001B; details in
/u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log.

ocssd.log on node rac2

11:48:29.195: [CSSD][1300728128]clssnmSendingThread: sending
join msg to all nodes
11:48:29.195: [CSSD][1300728128]clssnmSendingThread: sent
5 join msgs to all nodes
11:48:29.256: [CSSD][1164360000]clssnmvDHBValidateNCopy: node 1, rac1, has a
disk HB, but no network HB,
DHB has rcfg 148255197, wrtcnt, 1486324, LATS 4294342800, lastSeqNo 1486324,
uniqueness 1254298526, timestamp 1254304108/4745544
11:48:29.386: [CSSD][1227299136]clssnmvDHBValidateNCopy: node 1, rac1, has a
disk HB, but no network HB,
DHB has rcfg 148255197, wrtcnt, 1486325, LATS 4294342930, lastSeqNo 1486325,
uniqueness 1254298526, timestamp 1254304109/4746554
11:48:29.735: [CSSD][1311217984]clssnmRcfgMgrThread: Local Join
11:48:29.735: [CSSD][1311217984]clssnmLocalJoinEvent: begin on node(2),
waittime 193000
11:48:29.735: [CSSD][1311217984]clssnmLocalJoinEvent: set curtime
(4294343280) for my node
11:48:29.735: [CSSD][1311217984]clssnmLocalJoinEvent: scanning 32 nodes
11:48:29.735: [CSSD][1311217984]clssnmLocalJoinEvent: Node rac1, number 1, is
in an existing cluster with disk state 3
11:48:29.735: [CSSD][1311217984]clssnmLocalJoinEvent: takeover aborted due to
cluster member node found on disk
11:48:30.175: [CSSD][1279748416]clssgmWaitOnEventValue: after CmInfo State 
val 3, eval 1 waited 0
11:48:30.266: [CSSD][1164360000]clssnmvDHBValidateNCopy: node 1, rac1, has a
disk HB, but no network HB,
DHB has rcfg 148255197, wrtcnt, 1486325, LATS 4294343810, lastSeqNo 1486325,
uniqueness 1254298526, timestamp 1254304109/4746554
11:48:30.397: [CSSD][1227299136]clssnmvDHBValidateNCopy: node 1, rac1, has a
disk HB, but no network HB,
DHB has rcfg 148255197, wrtcnt, 1486326, LATS 4294343940, lastSeqNo 1486326,
uniqueness 1254298526, timestamp 1254304110/4747564
11:48:31.185: [CSSD][1279748416]clssgmWaitOnEventValue: after CmInfo State 
val 3, eval 1 waited 0
11:48:31.276: [CSSD][1164360000]clssnmvDHBValidateNCopy: node 1, rac1, has a
disk HB, but no network HB,
DHB has rcfg 148255197, wrtcnt, 1486326, LATS 4294344820, lastSeqNo 1486326,
uniqueness 1254298526, timestamp 1254304110/4747564
11:48:31.406: [CSSD][1227299136]clssnmvDHBValidateNCopy: node 1, rac1, has a
disk HB, but no network HB,
DHB has rcfg 148255197, wrtcnt, 1486327, LATS 4294344950, lastSeqNo 1486327,
uniqueness 1254298526, timestamp 1254304111/4748574
11:48:32.195: [CSSD][1279748416]clssgmWaitOnEventValue: after CmInfo State 
val 3, eval 1 waited 0
11:48:32.288: [CSSD][1164360000]clssnmvDHBValidateNCopy: node 1, rac1, has a
disk HB, but no network HB,
DHB has rcfg 148255197, wrtcnt, 1486327, LATS 4294345830, lastSeqNo 1486327,
uniqueness 1254298526, timestamp 1254304111/4748574
11:48:32.416: [CSSD][1227299136]clssnmvDHBValidateNCopy: node 1, rac1, has a
 disk HB, but no network HB,
DHB has rcfg 148255197, wrtcnt, 1486328, LATS 4294345960, lastSeqNo 1486328,
uniqueness 1254298526, timestamp 1254304112/4749584
11:48:33.205: [CSSD][1279748416]clssgmWaitOnEventValue: after CmInfo State 
val 3, eval 1 waited 0
11:48:33.296: [CSSD][1164360000]clssnmvDHBValidateNCopy: node 1, rac1, has a
disk HB, but no network HB,
 DHB has rcfg 148255197, wrtcnt, 1486328, LATS 4294346840, lastSeqNo 1486328,
uniqueness 1254298526, timestamp 1254304112/4749584
11:48:33.426: [CSSD][1227299136]clssnmvDHBValidateNCopy: node 1, rac1, has a
disk HB, but no network HB,
DHB has rcfg 148255197, wrtcnt, 1486329, LATS 4294346970, lastSeqNo 1486329,
uniqueness 1254298526, timestamp 1254304113/4750594
11:48:34.195: [CSSD][1300728128]clssnmSendingThread: sending join msg
to all nodes
11:48:34.195: [CSSD][1300728128]clssnmSendingThread: sent 5 join msgs
to all nodes
11:48:34.215: [CSSD][1279748416]clssgmWaitOnEventValue: after CmInfo State 
val 3, eval 1 waited 0

After waiting several minutes the node “rac2” reported:

2009-09-30 12:05:37.859
[cssd(4612)]CRS-1603:CSSD on node rac2 shutdown by user.
2009-09-30 12:05:37.858
[/u01/app/11.2.0/grid/bin/cssdagent(4596)]CRS-5818:Aborted command 'start
for resource: ora.cssd 1 1' for resource 'ora.cssd'. Details at
(:CRSAGF00113:) in
/u01/app/11.2.0/grid/log/rac2/agent/ohasd/oracssdagent_root/oracssdagent_root.log.
2009-09-30 12:05:42.884
[ohasd(3684)]CRS-2757:Command 'Start' timed out waiting for response from
the resource 'ora.cssd'. Details at (:CRSPE00111:) in
/u01/app/11.2.0/grid/log/rac2/ohasd/ohasd.log.
2009-09-30 12:05:43.302
[ohasd(3684)]CRS-2765:Resource 'ora.cssdmonitor' has failed on
server 'rac2'.
2009-09-30 12:36:26.398
[ohasd(3684)]CRS-2765:Resource 'ora.crsd' has failed on
server 'rac2'.

Although it says “CSSD on node rac2 shutdown by user” there was no manual intervention. CSSD was shut down by the clusterware stack. After crsd failed the system wont try to join the cluster again.

Starting up the system with private interconnect working again

After restoring private interface connection between both nodes the second node successfully joined the cluster after reboot.

2 Responses to ASM resilvering – or – how to recover your crashed cluster – Test no 2

Pingback: ASM resilvering – or – how to recover your crashed cluster « Ronny Egner's Blog
DanyC says:

April 13, 2011 at 13:53

Hi Ronny,

I have run a test on 11.2.0.2 RAC where i’ve disconnected the interconnect interface and i found out that after i’ve enabled back the 2nd node it took ~8 min to restart. Have you seen the same behaviour on your tests? I don’t think it’s normal.

Thanks,
dani