ASM resilvering – or – how to recover your crashed cluster – Test no 1

In the first test we will just disable one node under normal load. Lets see what the different components (clusterware, asm instance, advm/acfs) report:

Clusterware on node “rac2” – file “alertrac2.log”

2009-09-29 14:42:06.231
[cssd(3917)]CRS-1612:Network communication with node rac1 (1) missing
for 50% of timeout interval.  Removal of this node from cluster
in 14.440 seconds
2009-09-29 14:42:13.301
[cssd(3917)]CRS-1611:Network communication with node rac1 (1) missing
for 75% of timeout interval.  Removal of this node from cluster
in 7.370 seconds
2009-09-29 14:42:18.350
[cssd(3917)]CRS-1610:Network communication with node rac1 (1) missing
for 90% of timeout interval.  Removal of this node from cluster
in 2.320 seconds
2009-09-29 14:42:20.686
[cssd(3917)]CRS-1632:Node rac1 is being removed from the cluster
in cluster incarnation 148255184
2009-09-29 14:42:20.811
[cssd(3917)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac2 .
2009-09-29 14:42:21.234
[ctssd(4001)]CRS-2407:The new Cluster Time Synchronization Service
reference node is host rac2.
2009-09-29 14:42:29.202
[crsd(4386)]CRS-5504:Node down event reported for node 'rac1'.
2009-09-29 14:42:37.511
[crsd(4386)]CRS-2773:Server 'rac1' has been removed from pool 'Generic'.
2009-09-29 14:42:37.511
[crsd(4386)]CRS-2773:Server 'rac1' has been removed from pool 'ora.ora11p'.

This will be not surprise: After turning off the first node (named: “rac1”) the clusterware reports missing network heartbeat and removes the node from the cluster after 30 seconds, recovers the database and continues to run fine.

ASM instance on node “rac2”

Tue Sep 29 14:42:21 2009
Reconfiguration started (old inc 100, new inc 102)
List of instances:
 2 (myinst: 2)
 Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
* dead instance detected - domain 2 invalid = TRUE
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Tue Sep 29 14:42:22 2009
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
Tue Sep 29 14:42:22 2009
kfdp_query(DATA2): 8
Tue Sep 29 14:42:22 2009
NOTE: PST enabling heartbeating (grp 2)
NOTE: PST enabling heartbeating (grp 1)
kfdp_queryBg(): 8
NOTE: SMON starting instance recovery for group DATA2 domain 1 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: F1X0 found on disk 1 au 2 fcn 0.0
NOTE: starting recovery of thread=2 ckpt=17.324 group=1 (DATA2)
NOTE: SMON waiting for thread 2 recovery enqueue
NOTE: SMON about to begin recovery lock claims for diskgroup 1 (DATA2)
NOTE: advancing ckpt for thread=2 ckpt=17.324
NOTE: SMON did instance recovery for group DATA2 domain 1
Tue Sep 29 14:42:22 2009
NOTE: waiting for instance recovery of group 2
NOTE: waiting for instance recovery of group 2
kfdp_query(DATA): 9
kfdp_queryBg(): 9
NOTE: SMON starting instance recovery for group DATA domain 2 (mounted)
NOTE: F1X0 found on disk 1 au 2 fcn 0.831
NOTE: F1X0 found on disk 2 au 240 fcn 0.814
NOTE: F1X0 found on disk 3 au 2 fcn 0.814
NOTE: starting recovery of thread=2 ckpt=18.342 group=2 (DATA)
NOTE: SMON waiting for thread 2 recovery enqueue
NOTE: SMON about to begin recovery lock claims for diskgroup 2 (DATA)
NOTE: advancing ckpt for thread=2 ckpt=18.342
NOTE: SMON did instance recovery for group DATA domain 2

ASM recovers both disk groups as well.

ADVM messages on node “rac2”

[Oracle OKS] Cluster Membership change - Current  incarn 0x60
[Oracle OKS] Nodes in cluster:
[Oracle OKS]   Node 1 (IP 0xc0a8b50a) Node 2 (IP 0xc0a8b514)
[Oracle OKS] Node count 2
[Oracle OKS] Cluster Membership change - Current  incarn 0x61
[Oracle OKS] Nodes in cluster:
[Oracle OKS]   Node 2 (IP 0xc0a8b514)
[Oracle OKS] Node count 1
[Oracle ADVM] Cluster reconfiguration started.
[Oracle ADVM] Cluster reconfiguration completed.
[Oracle ADVM] Cluster reconfiguration completed.
[Oracle OKS] Cluster Membership change setup complete
[Oracle ADVM] Mirror recovery for volume DATA2-ORA11P_HOME started.
[Oracle ADVM] Mirror recovery for volume DATA2-ORA11P_HOME completed.

ADVM recovers the mirror for the ACFS volume as well.

Re-Starting the node

Clusterware

Alertrac2.log

[cssd(3917)]CRS-1601:CSSD Reconfiguration complete. Active
nodes are rac1 rac2 .
2009-09-29 14:59:34.214
[crsd(4386)]CRS-2772:Server 'rac1' has been assigned to pool 'Generic'.
2009-09-29 14:59:34.214
[crsd(4386)]CRS-2772:Server 'rac1' has been assigned to pool 'ora.ora11p'.

Alertrac1.log

2009-09-29 14:58:03.638
[ohasd(3700)]CRS-2112:The OLR service started on node rac1.
2009-09-29 14:58:04.085
[ohasd(3700)]CRS-8011:reboot advisory message from host: rac1, component:
mo205532, with time stamp: L-2009-09-28-23:02:37.992
[ohasd(3700)]CRS-8013:reboot advisory message text: clsnomon_status: need
to reboot, unexpected failure 8 received from CSS
2009-09-29 14:58:04.605
[ohasd(3700)]CRS-8011:reboot advisory message from host: rac1, component:
ag205532, with time stamp: L-2009-09-28-23:02:37.991
[ohasd(3700)]CRS-8013:reboot advisory message text: clsnomon_status: need
to reboot, unexpected failure 8 received from CSS
2009-09-29 14:58:04.649
[ohasd(3700)]CRS-8017:location: /etc/oracle/lastgasp has 38 reboot advisory
log files, 6 were announced and 0 errors occurred
2009-09-29 14:58:12.866
[ohasd(3700)]CRS-2772:Server 'rac1' has been assigned to pool 'Free'.
2009-09-29 14:58:15.711
[ohasd(3700)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON
(GPNPD daemon is not running).
2009-09-29 14:58:18.554
[cssd(3933)]CRS-1713:CSSD daemon is started in clustered mode
2009-09-29 14:58:39.604
[cssd(3933)]CRS-1707:Lease acquisition for node rac1 number 1 completed
2009-09-29 14:58:39.838
[cssd(3933)]CRS-1605:CSSD voting file is online: ORCL:DISK002B; details
in /u01/app/11.2.0/grid/log/rac1/cssd/ocssd.log.
2009-09-29 14:58:39.887
[cssd(3933)]CRS-1605:CSSD voting file is online: ORCL:DISK002A; details
in /u01/app/11.2.0/grid/log/rac1/cssd/ocssd.log.
2009-09-29 14:58:39.963
[cssd(3933)]CRS-1605:CSSD voting file is online: ORCL:DISK001B; details
in /u01/app/11.2.0/grid/log/rac1/cssd/ocssd.log.
2009-09-29 14:58:50.910
[cssd(3933)]CRS-1601:CSSD Reconfiguration complete. Active nodes
are rac1 rac2 .
2009-09-29 14:58:51.915
[ctssd(4019)]CRS-2403:The Cluster Time Synchronization Service on
host rac1 is in observer mode.
2009-09-29 14:58:51.940
[ctssd(4019)]CRS-2407:The new Cluster Time Synchronization Service
reference node is host rac2.
2009-09-29 14:58:52.559
[ctssd(4019)]CRS-2401:The Cluster Time Synchronization Service started
on host rac1.
[client(4179)]CRS-10001:ACFS-9327: Verifying ADVM/ACFS devices.
[client(4209)]CRS-10001:ACFS-9322: done.
2009-09-29 14:59:26.797
[crsd(4378)]CRS-1012:The OCR service started on node rac1.
2009-09-29 14:59:30.094
[crsd(4378)]CRS-1201:CRSD started on node rac1.
[client(4676)]CRS-10001:ACFS-9257: Mounting
/dev/asm/ora11p_home-132
on /u01/app/oracle/product/11.2.0/ora11p.
[client(4732)]CRS-10001:ACFS-9257: Mounting /dev/asm/ora11p_home-132
on /u01/app/oracle/product/11.2.0/ora11p.
[client(4762)]CRS-10001:ACFS-9258: Failed to mount
/dev/asm/ora11p_home-132 on /u01/app/oracle/product/11.2.0/ora11p.
2009-09-29 14:59:44.096
[/u01/app/11.2.0/grid/bin/orarootagent.bin(4513)]CRS-5016:Process
"/u01/app/11.2.0/grid/bin/acfssinglefsmount" spawned
by agent "/u01/app/11.2.0/grid/bin/orarootagent.bin" for action "start"
failed: details at "(:CLSN00010:)" in
"/u01/app/11.2.0/grid/log/rac1/agent/crsd/orarootagent_root/orarootagent_root.log"

The clusterware reports problems mounting the ACFS file system on the restarted node. The log file “orarootagent_root.log” gives an explanation:

2009-09-29 14:59:44.096: [ora.data2.ora11p_home.acfs][1511520576] [start]
(:CLSN00010:)mount: according to mtab, /dev/asm/ora11p_home-132
is already mounted on /u01/app/oracle/product/11.2.0/ora11p

2009-09-29 14:59:44.096: [ora.data2.ora11p_home.acfs][1511520576] [start]
(:CLSN00010:)acfssinglefsmount: ACFS-9258: Failed to mount
/dev/asm/ora11p_home-132 on /u01/app/oracle/product/11.2.0/ora11p.

2009-09-29 14:59:44.096: [ora.data2.ora11p_home.acfs][1511520576]
[start] Exception type=2 string=ACFS-9257: Mounting /dev/asm/ora11p_home-132
on /u01/app/oracle/product/11.2.0/ora11p.
mount: /dev/asm/ora11p_home-132 already mounted or
/u01/app/oracle/product/11.2.0/ora11p busy

The explanation for this behavior is: The file /etc/mtab contains a list of all  mounted file systems for the host. Because we suddenly turned off the power supply the mtab file was not properly cleaned up. After reboot clusterware tries to mount the volume “/dev/asm/ora11p_ome-132” by issueing a mount command just like you would normally do. mount itself checks /etc/mtab for already mounted file systems and reports the file system is already mounted. But anyway oracle seems to succeed mounting the ACFS file system.

ASM messages

Node “rac2”

Tue Sep 29 14:59:13 2009
 Reconfiguration started (old inc 102, new inc 104)
 List of instances:
 1 2 (myinst: 2)
 Global Resource Directory frozen
 Communication channels reestablished
 Tue Sep 29 14:59:13 2009
 * domain 0 valid = 1 according to instance 1
 * domain 1 valid = 1 according to instance 1
 * domain 2 valid = 1 according to instance 1
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 Tue Sep 29 14:59:13 2009
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
 Reconfiguration complete

Node “rac1”

Tue Sep 29 14:59:05 2009
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 bond1 192.168.181.0 configured from
GPnP Profile for use as a cluster interconnect
Interface type 1 bond0 172.23.0.0 configured from
GPnP Profile for use as a public interface
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value
as /u01/app/11.2.0/grid/dbs/arch
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
NOTE: Volume support  enabled
Starting up:
Oracle Database 11g Enterprise Edition Release
11.2.0.1.0 - 64bit Production
With the Real Application Clusters and Automatic
Storage Management options.
Using parameter settings in server-side spfile
+DATA/rac/asmparameterfile/registry.253.697417189
System parameters with non-default values:
 large_pool_size          = 12M
 instance_type            = "asm"
 remote_login_passwordfile= "EXCLUSIVE"
 asm_diskgroups           = "DATA2"
 asm_power_limit          = 1
 diagnostic_dest          = "/u01/app/oracle"
Cluster communication is configured to use the
following interface(s) for this instance
 192.168.181.10
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
Tue Sep 29 14:59:08 2009
PMON started with pid=2, OS id=4294
Tue Sep 29 14:59:08 2009
VKTM started with pid=3, OS id=4296 at elevated priority
VKTM running at (10)millisec precision with DBRM quantum (100)ms
Tue Sep 29 14:59:08 2009
GEN0 started with pid=4, OS id=4300
Tue Sep 29 14:59:09 2009
DIAG started with pid=5, OS id=4302
Tue Sep 29 14:59:09 2009
PING started with pid=6, OS id=4304
Tue Sep 29 14:59:09 2009
PSP0 started with pid=7, OS id=4306
Tue Sep 29 14:59:09 2009
DIA0 started with pid=8, OS id=4308
Tue Sep 29 14:59:09 2009
LMON started with pid=9, OS id=4310
Tue Sep 29 14:59:11 2009
LMD0 started with pid=10, OS id=4312
Tue Sep 29 14:59:11 2009
LMS0 started with pid=11, OS id=4316 at elevated priority
Tue Sep 29 14:59:11 2009
LMHB started with pid=12, OS id=4320
Tue Sep 29 14:59:12 2009
MMAN started with pid=13, OS id=4322
Tue Sep 29 14:59:12 2009
DBW0 started with pid=14, OS id=4324
Tue Sep 29 14:59:12 2009
LGWR started with pid=15, OS id=4326
Tue Sep 29 14:59:12 2009
CKPT started with pid=16, OS id=4328
Tue Sep 29 14:59:12 2009
RBAL started with pid=18, OS id=4332
Tue Sep 29 14:59:12 2009
GMON started with pid=19, OS id=4334
Tue Sep 29 14:59:12 2009
SMON started with pid=17, OS id=4330
Tue Sep 29 14:59:12 2009
MMON started with pid=20, OS id=4336
Tue Sep 29 14:59:12 2009
MMNL started with pid=21, OS id=4338
lmon registered with NM - instance number 1 (internal mem no 0)
Reconfiguration started (old inc 0, new inc 104)
ASM instance
List of instances:
 1 2 (myinst: 1)
 Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
 Communication channels reestablished
* allocate domain 1, invalid = TRUE
* allocate domain 2, invalid = TRUE
 * domain 0 valid = 1 according to instance 2
 * domain 1 valid = 1 according to instance 2
 * domain 2 valid = 1 according to instance 2
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Tue Sep 29 14:59:14 2009
LCK0 started with pid=22, OS id=4344
Tue Sep 29 14:59:17 2009
SQL> ALTER DISKGROUP ALL MOUNT /* asm agent */
Diskgroup with spfile:DATA
Diskgroup used for OCR is:DATA
NOTE: cache registered group DATA2 number=1 incarn=0x454864be
NOTE: cache began mount (not first) of group DATA2 number=1 incarn=0x454864be
NOTE: cache registered group DATA number=2 incarn=0x458864bf
NOTE: cache began mount (not first) of group DATA number=2 incarn=0x458864bf
NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
NOTE: Assigning number (1,0) to disk (ORCL:DISK003A)
NOTE: Assigning number (1,1) to disk (ORCL:DISK003B)
NOTE: Assigning number (2,0) to disk (ORCL:DISK001A)
NOTE: Assigning number (2,1) to disk (ORCL:DISK001B)
NOTE: Assigning number (2,2) to disk (ORCL:DISK002A)
NOTE: Assigning number (2,3) to disk (ORCL:DISK002B)
kfdp_query(DATA2): 3
kfdp_queryBg(): 3
NOTE: cache opening disk 0 of grp 1: DISK003A label:DISK003A
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache opening disk 1 of grp 1: DISK003B label:DISK003B
NOTE: F1X0 found on disk 1 au 2 fcn 0.0
NOTE: cache mounting (not first) normal redundancy group 1/0x454864BE (DATA2)
kjbdomatt send to inst 2
NOTE: attached to recovery domain 1
NOTE: LGWR attempting to mount thread 2 for diskgroup 1 (DATA2)
NOTE: LGWR found thread 2 closed at ABA 17.323
NOTE: LGWR mounted thread 2 for diskgroup 1 (DATA2)
NOTE: LGWR opening thread 2 at fcn 0.2638 ABA 18.324
NOTE: cache mounting group 1/0x454864BE (DATA2) succeeded
NOTE: cache ending mount (success) of group DATA2 number=1 incarn=0x454864be
kfdp_query(DATA): 4
kfdp_queryBg(): 4
kfdp_query(DATA): 5
kfdp_queryBg(): 5
NOTE: cache opening disk 1 of grp 2: DISK001B label:DISK001B
NOTE: F1X0 found on disk 1 au 2 fcn 0.831
NOTE: cache opening disk 2 of grp 2: DISK002A label:DISK002A
NOTE: F1X0 found on disk 2 au 240 fcn 0.814
NOTE: cache opening disk 3 of grp 2: DISK002B label:DISK002B
NOTE: F1X0 found on disk 3 au 2 fcn 0.814
NOTE: cache mounting (not first) normal redundancy group 2/0x458864BF (DATA)
kjbdomatt send to inst 2
NOTE: attached to recovery domain 2
NOTE: LGWR attempting to mount thread 2 for diskgroup 2 (DATA)
NOTE: LGWR found thread 2 closed at ABA 18.341
NOTE: LGWR mounted thread 2 for diskgroup 2 (DATA)
NOTE: LGWR opening thread 2 at fcn 0.4042 ABA 19.342
NOTE: cache mounting group 2/0x458864BF (DATA) succeeded
NOTE: cache ending mount (success) of group DATA number=2 incarn=0x458864bf
kfdp_query(DATA2): 6
kfdp_queryBg(): 6
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1
SUCCESS: diskgroup DATA2 was mounted
NOTE: De-assigning number (2,0) from disk (ORCL:DISK001A)
kfdp_query(DATA): 7
kfdp_queryBg(): 7
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 2
SUCCESS: diskgroup DATA was mounted
SUCCESS: ALTER DISKGROUP ALL MOUNT /* asm agent */
SQL> ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent */
Starting background process VBG0
Starting background process ACFS
Tue Sep 29 14:59:20 2009
ACFS started with pid=25, OS id=4359
Tue Sep 29 14:59:20 2009
VBG0 started with pid=24, OS id=4357
SUCCESS: ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent */
Tue Sep 29 14:59:25 2009
Starting background process ASMB
Tue Sep 29 14:59:25 2009
ASMB started with pid=26, OS id=4393
Tue Sep 29 14:59:40 2009
Starting background process VDBG
Tue Sep 29 14:59:40 2009
VDBG started with pid=32, OS id=4713
Starting background process VMB0
Tue Sep 29 14:59:40 2009
VMB0 started with pid=33, OS id=4718
Starting background process VBG1
Starting background process VBG2
Tue Sep 29 14:59:43 2009
VBG2 started with pid=35, OS id=4752
Tue Sep 29 14:59:43 2009
VBG1 started with pid=34, OS id=4750
Tue Sep 29 15:00:33 2009
ALTER SYSTEM SET local_listener='(DESCRIPTION=
(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=rac1-vip)(PORT=1521))))'
SCOPE=MEMORY SID='+ASM1';

ADVM messages

Node “rac2”

[Oracle OKS] Cluster Membership change - Current  incarn 0x64
 [Oracle OKS] Nodes in cluster:
 [Oracle OKS]   Node 1 (IP 0xc0a8b50a) Node 2 (IP 0xc0a8b514)
 [Oracle OKS] Node count 2
 [Oracle ADVM] Cluster reconfiguration started.
 [Oracle ADVM] Cluster reconfiguration completed.
 [Oracle ADVM] Cluster reconfiguration completed.
 [Oracle OKS] Cluster Membership change setup complete

Node “rac1”

[Oracle OKS] mallocing log buffer, size=10485760
[Oracle OKS] log buffer = 0xffffc20000006000
[Oracle OKS] Module load succeeded. Build information:
(LOW DEBUG)- (USM_11.2.0.1.0_LINUX.X64_090715.1) , Built on Aug  2 2009
oracleadvm: no version for "KsRwLockReadTry" found: kernel tainted.
[Oracle ADVM] Module load succeeded. Build information: 
(LOW DEBUG) - USM_11.2.0.1.0_LINUX.X64_090715.1  Built on Aug  2 2009
[Oracle ACFS] Module load succeeded. Build information: 
(LOW DEBUG), (USM_11.2.0.1.0_LINUX.X64_090715.1), Built on Aug  2 2009
[Oracle OKS] Cluster Membership change - Current  incarn 0x64
[Oracle OKS] Nodes in cluster:
[Oracle OKS]   Node 1 (IP 0xc0a8b50a) Node 2 (IP 0xc0a8b514)
[Oracle OKS] Node count 2

Conclustion

Suddenly turning off the power seems to be no problem for the cluster. Even after more than twenty tests the whole cluster starts successfully every time. Quite good :-)

2 thoughts on “ASM resilvering – or – how to recover your crashed cluster – Test no 1”

Leave a Reply

Your email address will not be published.