In the first test we will just disable one node under normal load. Lets see what the different components (clusterware, asm instance, advm/acfs) report:
Clusterware on node “rac2” – file “alertrac2.log”
2009-09-29 14:42:06.231 [cssd(3917)]CRS-1612:Network communication with node rac1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.440 seconds 2009-09-29 14:42:13.301 [cssd(3917)]CRS-1611:Network communication with node rac1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 7.370 seconds 2009-09-29 14:42:18.350 [cssd(3917)]CRS-1610:Network communication with node rac1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.320 seconds 2009-09-29 14:42:20.686 [cssd(3917)]CRS-1632:Node rac1 is being removed from the cluster in cluster incarnation 148255184 2009-09-29 14:42:20.811 [cssd(3917)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac2 . 2009-09-29 14:42:21.234 [ctssd(4001)]CRS-2407:The new Cluster Time Synchronization Service reference node is host rac2. 2009-09-29 14:42:29.202 [crsd(4386)]CRS-5504:Node down event reported for node 'rac1'. 2009-09-29 14:42:37.511 [crsd(4386)]CRS-2773:Server 'rac1' has been removed from pool 'Generic'. 2009-09-29 14:42:37.511 [crsd(4386)]CRS-2773:Server 'rac1' has been removed from pool 'ora.ora11p'.
This will be not surprise: After turning off the first node (named: “rac1”) the clusterware reports missing network heartbeat and removes the node from the cluster after 30 seconds, recovers the database and continues to run fine.
ASM instance on node “rac2”
Tue Sep 29 14:42:21 2009 Reconfiguration started (old inc 100, new inc 102) List of instances: 2 (myinst: 2) Global Resource Directory frozen * dead instance detected - domain 1 invalid = TRUE * dead instance detected - domain 2 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Tue Sep 29 14:42:22 2009 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Tue Sep 29 14:42:22 2009 kfdp_query(DATA2): 8 Tue Sep 29 14:42:22 2009 NOTE: PST enabling heartbeating (grp 2) NOTE: PST enabling heartbeating (grp 1) kfdp_queryBg(): 8 NOTE: SMON starting instance recovery for group DATA2 domain 1 (mounted) NOTE: F1X0 found on disk 0 au 2 fcn 0.0 NOTE: F1X0 found on disk 1 au 2 fcn 0.0 NOTE: starting recovery of thread=2 ckpt=17.324 group=1 (DATA2) NOTE: SMON waiting for thread 2 recovery enqueue NOTE: SMON about to begin recovery lock claims for diskgroup 1 (DATA2) NOTE: advancing ckpt for thread=2 ckpt=17.324 NOTE: SMON did instance recovery for group DATA2 domain 1 Tue Sep 29 14:42:22 2009 NOTE: waiting for instance recovery of group 2 NOTE: waiting for instance recovery of group 2 kfdp_query(DATA): 9 kfdp_queryBg(): 9 NOTE: SMON starting instance recovery for group DATA domain 2 (mounted) NOTE: F1X0 found on disk 1 au 2 fcn 0.831 NOTE: F1X0 found on disk 2 au 240 fcn 0.814 NOTE: F1X0 found on disk 3 au 2 fcn 0.814 NOTE: starting recovery of thread=2 ckpt=18.342 group=2 (DATA) NOTE: SMON waiting for thread 2 recovery enqueue NOTE: SMON about to begin recovery lock claims for diskgroup 2 (DATA) NOTE: advancing ckpt for thread=2 ckpt=18.342 NOTE: SMON did instance recovery for group DATA domain 2
ASM recovers both disk groups as well.
ADVM messages on node “rac2”
[Oracle OKS] Cluster Membership change - Current incarn 0x60 [Oracle OKS] Nodes in cluster: [Oracle OKS] Node 1 (IP 0xc0a8b50a) Node 2 (IP 0xc0a8b514) [Oracle OKS] Node count 2 [Oracle OKS] Cluster Membership change - Current incarn 0x61 [Oracle OKS] Nodes in cluster: [Oracle OKS] Node 2 (IP 0xc0a8b514) [Oracle OKS] Node count 1 [Oracle ADVM] Cluster reconfiguration started. [Oracle ADVM] Cluster reconfiguration completed. [Oracle ADVM] Cluster reconfiguration completed. [Oracle OKS] Cluster Membership change setup complete [Oracle ADVM] Mirror recovery for volume DATA2-ORA11P_HOME started. [Oracle ADVM] Mirror recovery for volume DATA2-ORA11P_HOME completed.
ADVM recovers the mirror for the ACFS volume as well.
Re-Starting the node
Clusterware
Alertrac2.log
[cssd(3917)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 rac2 . 2009-09-29 14:59:34.214 [crsd(4386)]CRS-2772:Server 'rac1' has been assigned to pool 'Generic'. 2009-09-29 14:59:34.214 [crsd(4386)]CRS-2772:Server 'rac1' has been assigned to pool 'ora.ora11p'.
Alertrac1.log
2009-09-29 14:58:03.638 [ohasd(3700)]CRS-2112:The OLR service started on node rac1. 2009-09-29 14:58:04.085 [ohasd(3700)]CRS-8011:reboot advisory message from host: rac1, component: mo205532, with time stamp: L-2009-09-28-23:02:37.992 [ohasd(3700)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS 2009-09-29 14:58:04.605 [ohasd(3700)]CRS-8011:reboot advisory message from host: rac1, component: ag205532, with time stamp: L-2009-09-28-23:02:37.991 [ohasd(3700)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS 2009-09-29 14:58:04.649 [ohasd(3700)]CRS-8017:location: /etc/oracle/lastgasp has 38 reboot advisory log files, 6 were announced and 0 errors occurred 2009-09-29 14:58:12.866 [ohasd(3700)]CRS-2772:Server 'rac1' has been assigned to pool 'Free'. 2009-09-29 14:58:15.711 [ohasd(3700)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running). 2009-09-29 14:58:18.554 [cssd(3933)]CRS-1713:CSSD daemon is started in clustered mode 2009-09-29 14:58:39.604 [cssd(3933)]CRS-1707:Lease acquisition for node rac1 number 1 completed 2009-09-29 14:58:39.838 [cssd(3933)]CRS-1605:CSSD voting file is online: ORCL:DISK002B; details in /u01/app/11.2.0/grid/log/rac1/cssd/ocssd.log. 2009-09-29 14:58:39.887 [cssd(3933)]CRS-1605:CSSD voting file is online: ORCL:DISK002A; details in /u01/app/11.2.0/grid/log/rac1/cssd/ocssd.log. 2009-09-29 14:58:39.963 [cssd(3933)]CRS-1605:CSSD voting file is online: ORCL:DISK001B; details in /u01/app/11.2.0/grid/log/rac1/cssd/ocssd.log. 2009-09-29 14:58:50.910 [cssd(3933)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 rac2 . 2009-09-29 14:58:51.915 [ctssd(4019)]CRS-2403:The Cluster Time Synchronization Service on host rac1 is in observer mode. 2009-09-29 14:58:51.940 [ctssd(4019)]CRS-2407:The new Cluster Time Synchronization Service reference node is host rac2. 2009-09-29 14:58:52.559 [ctssd(4019)]CRS-2401:The Cluster Time Synchronization Service started on host rac1. [client(4179)]CRS-10001:ACFS-9327: Verifying ADVM/ACFS devices. [client(4209)]CRS-10001:ACFS-9322: done. 2009-09-29 14:59:26.797 [crsd(4378)]CRS-1012:The OCR service started on node rac1. 2009-09-29 14:59:30.094 [crsd(4378)]CRS-1201:CRSD started on node rac1. [client(4676)]CRS-10001:ACFS-9257: Mounting /dev/asm/ora11p_home-132 on /u01/app/oracle/product/11.2.0/ora11p. [client(4732)]CRS-10001:ACFS-9257: Mounting /dev/asm/ora11p_home-132 on /u01/app/oracle/product/11.2.0/ora11p. [client(4762)]CRS-10001:ACFS-9258: Failed to mount /dev/asm/ora11p_home-132 on /u01/app/oracle/product/11.2.0/ora11p. 2009-09-29 14:59:44.096 [/u01/app/11.2.0/grid/bin/orarootagent.bin(4513)]CRS-5016:Process "/u01/app/11.2.0/grid/bin/acfssinglefsmount" spawned by agent "/u01/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/rac1/agent/crsd/orarootagent_root/orarootagent_root.log"
The clusterware reports problems mounting the ACFS file system on the restarted node. The log file “orarootagent_root.log” gives an explanation:
2009-09-29 14:59:44.096: [ora.data2.ora11p_home.acfs][1511520576] [start] (:CLSN00010:)mount: according to mtab, /dev/asm/ora11p_home-132 is already mounted on /u01/app/oracle/product/11.2.0/ora11p 2009-09-29 14:59:44.096: [ora.data2.ora11p_home.acfs][1511520576] [start] (:CLSN00010:)acfssinglefsmount: ACFS-9258: Failed to mount /dev/asm/ora11p_home-132 on /u01/app/oracle/product/11.2.0/ora11p. 2009-09-29 14:59:44.096: [ora.data2.ora11p_home.acfs][1511520576] [start] Exception type=2 string=ACFS-9257: Mounting /dev/asm/ora11p_home-132 on /u01/app/oracle/product/11.2.0/ora11p. mount: /dev/asm/ora11p_home-132 already mounted or /u01/app/oracle/product/11.2.0/ora11p busy
The explanation for this behavior is: The file /etc/mtab contains a list of all mounted file systems for the host. Because we suddenly turned off the power supply the mtab file was not properly cleaned up. After reboot clusterware tries to mount the volume “/dev/asm/ora11p_ome-132” by issueing a mount command just like you would normally do. mount itself checks /etc/mtab for already mounted file systems and reports the file system is already mounted. But anyway oracle seems to succeed mounting the ACFS file system.
ASM messages
Node “rac2”
Tue Sep 29 14:59:13 2009 Reconfiguration started (old inc 102, new inc 104) List of instances: 1 2 (myinst: 2) Global Resource Directory frozen Communication channels reestablished Tue Sep 29 14:59:13 2009 * domain 0 valid = 1 according to instance 1 * domain 1 valid = 1 according to instance 1 * domain 2 valid = 1 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Tue Sep 29 14:59:13 2009 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete
Node “rac1”
Tue Sep 29 14:59:05 2009 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Interface type 1 bond1 192.168.181.0 configured from GPnP Profile for use as a cluster interconnect Interface type 1 bond0 172.23.0.0 configured from GPnP Profile for use as a public interface Picked latch-free SCN scheme 3 Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/11.2.0/grid/dbs/arch Autotune of undo retention is turned on. LICENSE_MAX_USERS = 0 SYS auditing is disabled NOTE: Volume support enabled Starting up: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options. Using parameter settings in server-side spfile +DATA/rac/asmparameterfile/registry.253.697417189 System parameters with non-default values: large_pool_size = 12M instance_type = "asm" remote_login_passwordfile= "EXCLUSIVE" asm_diskgroups = "DATA2" asm_power_limit = 1 diagnostic_dest = "/u01/app/oracle" Cluster communication is configured to use the following interface(s) for this instance 192.168.181.10 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 Tue Sep 29 14:59:08 2009 PMON started with pid=2, OS id=4294 Tue Sep 29 14:59:08 2009 VKTM started with pid=3, OS id=4296 at elevated priority VKTM running at (10)millisec precision with DBRM quantum (100)ms Tue Sep 29 14:59:08 2009 GEN0 started with pid=4, OS id=4300 Tue Sep 29 14:59:09 2009 DIAG started with pid=5, OS id=4302 Tue Sep 29 14:59:09 2009 PING started with pid=6, OS id=4304 Tue Sep 29 14:59:09 2009 PSP0 started with pid=7, OS id=4306 Tue Sep 29 14:59:09 2009 DIA0 started with pid=8, OS id=4308 Tue Sep 29 14:59:09 2009 LMON started with pid=9, OS id=4310 Tue Sep 29 14:59:11 2009 LMD0 started with pid=10, OS id=4312 Tue Sep 29 14:59:11 2009 LMS0 started with pid=11, OS id=4316 at elevated priority Tue Sep 29 14:59:11 2009 LMHB started with pid=12, OS id=4320 Tue Sep 29 14:59:12 2009 MMAN started with pid=13, OS id=4322 Tue Sep 29 14:59:12 2009 DBW0 started with pid=14, OS id=4324 Tue Sep 29 14:59:12 2009 LGWR started with pid=15, OS id=4326 Tue Sep 29 14:59:12 2009 CKPT started with pid=16, OS id=4328 Tue Sep 29 14:59:12 2009 RBAL started with pid=18, OS id=4332 Tue Sep 29 14:59:12 2009 GMON started with pid=19, OS id=4334 Tue Sep 29 14:59:12 2009 SMON started with pid=17, OS id=4330 Tue Sep 29 14:59:12 2009 MMON started with pid=20, OS id=4336 Tue Sep 29 14:59:12 2009 MMNL started with pid=21, OS id=4338 lmon registered with NM - instance number 1 (internal mem no 0) Reconfiguration started (old inc 0, new inc 104) ASM instance List of instances: 1 2 (myinst: 1) Global Resource Directory frozen * allocate domain 0, invalid = TRUE Communication channels reestablished * allocate domain 1, invalid = TRUE * allocate domain 2, invalid = TRUE * domain 0 valid = 1 according to instance 2 * domain 1 valid = 1 according to instance 2 * domain 2 valid = 1 according to instance 2 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete Tue Sep 29 14:59:14 2009 LCK0 started with pid=22, OS id=4344 Tue Sep 29 14:59:17 2009 SQL> ALTER DISKGROUP ALL MOUNT /* asm agent */ Diskgroup with spfile:DATA Diskgroup used for OCR is:DATA NOTE: cache registered group DATA2 number=1 incarn=0x454864be NOTE: cache began mount (not first) of group DATA2 number=1 incarn=0x454864be NOTE: cache registered group DATA number=2 incarn=0x458864bf NOTE: cache began mount (not first) of group DATA number=2 incarn=0x458864bf NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so NOTE: Assigning number (1,0) to disk (ORCL:DISK003A) NOTE: Assigning number (1,1) to disk (ORCL:DISK003B) NOTE: Assigning number (2,0) to disk (ORCL:DISK001A) NOTE: Assigning number (2,1) to disk (ORCL:DISK001B) NOTE: Assigning number (2,2) to disk (ORCL:DISK002A) NOTE: Assigning number (2,3) to disk (ORCL:DISK002B) kfdp_query(DATA2): 3 kfdp_queryBg(): 3 NOTE: cache opening disk 0 of grp 1: DISK003A label:DISK003A NOTE: F1X0 found on disk 0 au 2 fcn 0.0 NOTE: cache opening disk 1 of grp 1: DISK003B label:DISK003B NOTE: F1X0 found on disk 1 au 2 fcn 0.0 NOTE: cache mounting (not first) normal redundancy group 1/0x454864BE (DATA2) kjbdomatt send to inst 2 NOTE: attached to recovery domain 1 NOTE: LGWR attempting to mount thread 2 for diskgroup 1 (DATA2) NOTE: LGWR found thread 2 closed at ABA 17.323 NOTE: LGWR mounted thread 2 for diskgroup 1 (DATA2) NOTE: LGWR opening thread 2 at fcn 0.2638 ABA 18.324 NOTE: cache mounting group 1/0x454864BE (DATA2) succeeded NOTE: cache ending mount (success) of group DATA2 number=1 incarn=0x454864be kfdp_query(DATA): 4 kfdp_queryBg(): 4 kfdp_query(DATA): 5 kfdp_queryBg(): 5 NOTE: cache opening disk 1 of grp 2: DISK001B label:DISK001B NOTE: F1X0 found on disk 1 au 2 fcn 0.831 NOTE: cache opening disk 2 of grp 2: DISK002A label:DISK002A NOTE: F1X0 found on disk 2 au 240 fcn 0.814 NOTE: cache opening disk 3 of grp 2: DISK002B label:DISK002B NOTE: F1X0 found on disk 3 au 2 fcn 0.814 NOTE: cache mounting (not first) normal redundancy group 2/0x458864BF (DATA) kjbdomatt send to inst 2 NOTE: attached to recovery domain 2 NOTE: LGWR attempting to mount thread 2 for diskgroup 2 (DATA) NOTE: LGWR found thread 2 closed at ABA 18.341 NOTE: LGWR mounted thread 2 for diskgroup 2 (DATA) NOTE: LGWR opening thread 2 at fcn 0.4042 ABA 19.342 NOTE: cache mounting group 2/0x458864BF (DATA) succeeded NOTE: cache ending mount (success) of group DATA number=2 incarn=0x458864bf kfdp_query(DATA2): 6 kfdp_queryBg(): 6 NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1 SUCCESS: diskgroup DATA2 was mounted NOTE: De-assigning number (2,0) from disk (ORCL:DISK001A) kfdp_query(DATA): 7 kfdp_queryBg(): 7 NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 2 SUCCESS: diskgroup DATA was mounted SUCCESS: ALTER DISKGROUP ALL MOUNT /* asm agent */ SQL> ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent */ Starting background process VBG0 Starting background process ACFS Tue Sep 29 14:59:20 2009 ACFS started with pid=25, OS id=4359 Tue Sep 29 14:59:20 2009 VBG0 started with pid=24, OS id=4357 SUCCESS: ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent */ Tue Sep 29 14:59:25 2009 Starting background process ASMB Tue Sep 29 14:59:25 2009 ASMB started with pid=26, OS id=4393 Tue Sep 29 14:59:40 2009 Starting background process VDBG Tue Sep 29 14:59:40 2009 VDBG started with pid=32, OS id=4713 Starting background process VMB0 Tue Sep 29 14:59:40 2009 VMB0 started with pid=33, OS id=4718 Starting background process VBG1 Starting background process VBG2 Tue Sep 29 14:59:43 2009 VBG2 started with pid=35, OS id=4752 Tue Sep 29 14:59:43 2009 VBG1 started with pid=34, OS id=4750 Tue Sep 29 15:00:33 2009 ALTER SYSTEM SET local_listener='(DESCRIPTION= (ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=rac1-vip)(PORT=1521))))' SCOPE=MEMORY SID='+ASM1';
ADVM messages
Node “rac2”
[Oracle OKS] Cluster Membership change - Current incarn 0x64 [Oracle OKS] Nodes in cluster: [Oracle OKS] Node 1 (IP 0xc0a8b50a) Node 2 (IP 0xc0a8b514) [Oracle OKS] Node count 2 [Oracle ADVM] Cluster reconfiguration started. [Oracle ADVM] Cluster reconfiguration completed. [Oracle ADVM] Cluster reconfiguration completed. [Oracle OKS] Cluster Membership change setup complete
Node “rac1”
[Oracle OKS] mallocing log buffer, size=10485760 [Oracle OKS] log buffer = 0xffffc20000006000 [Oracle OKS] Module load succeeded. Build information: (LOW DEBUG)- (USM_11.2.0.1.0_LINUX.X64_090715.1) , Built on Aug 2 2009 oracleadvm: no version for "KsRwLockReadTry" found: kernel tainted. [Oracle ADVM] Module load succeeded. Build information: (LOW DEBUG) - USM_11.2.0.1.0_LINUX.X64_090715.1 Built on Aug 2 2009 [Oracle ACFS] Module load succeeded. Build information: (LOW DEBUG), (USM_11.2.0.1.0_LINUX.X64_090715.1), Built on Aug 2 2009 [Oracle OKS] Cluster Membership change - Current incarn 0x64 [Oracle OKS] Nodes in cluster: [Oracle OKS] Node 1 (IP 0xc0a8b50a) Node 2 (IP 0xc0a8b514) [Oracle OKS] Node count 2
Conclustion
Suddenly turning off the power seems to be no problem for the cluster. Even after more than twenty tests the whole cluster starts successfully every time. Quite good :-)
Pingback: ASM resilvering – or – how to recover your crashed cluster « Ronny Egner's Blog
Pingback: Ronny Egners Blog » ASM resilvering – or – how to recover your crashed cluster