ASM resilvering – or – how to recover your crashed cluster – Test no 3

Test #3: Overwriting the ASM disk header with ASM disk group being offline

In this post we will dig a little bit deeper with ASM. We will overwrite the ASM disk header on one disk while the disk group is offline, check the results and repair the LUN again.

Overwrite the disk header of disk “DISK003A”

dd if=/dev/random bs=8k count=1 of=/dev/sdg1

Try to mount the disk group

SQL> alter diskgroup DATA2 mount;
alter diskgroup data2 mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "0" is missing from group number "2"

The ASM alert.log shows:

NOTE: group DATA2: updated PST location: disk 0001 (PST copy 0)
NOTE: start heartbeating (grp 2)
kfdp_query(DATA2): 13
kfdp_queryBg(): 13
NOTE: group DATA2: updated PST location: disk 0001 (PST copy 0)
NOTE: Assigning number (2,0) to disk ()
kfdp_query(DATA2): 14
kfdp_queryBg(): 14
NOTE: group DATA2: updated PST location: disk 0001 (PST copy 0)
NOTE: cache dismounting (clean) group 2/0x635ADC92 (DATA2)
NOTE: dbwr not being msg'd to dismount
NOTE: lgwr not being msg'd to dismount
NOTE: cache dismounted group 2/0x635ADC92 (DATA2)
NOTE: cache ending mount (fail) of group DATA2 number=2 incarn=0x635adc92
kfdp_dismount(): 15
kfdp_dismountBg(): 15
NOTE: De-assigning number (2,0) from disk ()
NOTE: De-assigning number (2,1) from disk (ORCL:DISK003B)
ERROR: diskgroup DATA2 was not mounted
NOTE: cache deleting context for group DATA2 2/1666899090
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "0" is missing from group number "2"
ERROR: alter diskgroup data2 mount
Thu Oct 01 14:12:04 2009
ASM Health Checker found 1 new failures

Depending on if you rebooted the server or did a “oracleasm scandisks” the following query might look different:

SQL> set pages 40000 lines 120
SQL> col PATH for a30
SQL> select DISK_NUMBER,MOUNT_STATUS,HEADER_STATUS,MODE_STATUS,STATE,PATH FROM V$ASM_DISK;

DISK_NUMBER MOUNT_S HEADER_STATU MODE_ST STATE    PATH
----------- ------- ------------ ------- -------- ------------------------------
0 CLOSED  MEMBER       ONLINE  NORMAL   ORCL:DISK001A
3 CLOSED  MEMBER       ONLINE  NORMAL   ORCL:DISK003B
2 CLOSED  CANDIDATE    ONLINE  NORMAL   ORCL:DISK003A
1 CACHED  MEMBER       ONLINE  NORMAL   ORCL:DISK001B
2 CACHED  MEMBER       ONLINE  NORMAL   ORCL:DISK002A
3 CACHED  MEMBER       ONLINE  NORMAL   ORCL:DISK002B

Disk no. 1 is marked as “CANDIDATE”… ASM still remembers the disk as being there.

But if you do a “oracleadm scandisks” the asm library will re-scan all disks and re-read the asm headers. This time the query will look like this:

SQL> set pages 40000 lines 120
SQL> col PATH for a30
SQL> select DISK_NUMBER,MOUNT_STATUS,HEADER_STATUS,MODE_STATUS,STATE,PATH FROM V$ASM_DISK;

DISK_NUMBER MOUNT_S HEADER_STATU MODE_ST STATE    PATH
----------- ------- ------------ ------- -------- ------------------------------
 0 CLOSED  MEMBER       ONLINE  NORMAL   ORCL:DISK001A
 2 CLOSED  MEMBER       ONLINE  NORMAL   ORCL:DISK003B
 1 CACHED  MEMBER       ONLINE  NORMAL   ORCL:DISK001B
 2 CACHED  MEMBER       ONLINE  NORMAL   ORCL:DISK002A
 3 CACHED  MEMBER       ONLINE  NORMAL   ORCL:DISK002B

Former disk “DISK003A” is gone. No surprise because we rescanned all disks and asm removed DISK003A because it does not have a valid disk header anymore. The same result will be seen after a system reboot.

Try to force disk group mount

Although we miss one mirror we should be able to mount the disk group with the mirror left:

SQL> alter diskgroup data2 mount force;
Diskgroup altered.

The ASM alert.log shows:

SQL> alter diskgroup data2 mount force
NOTE: cache registered group DATA2 number=2 incarn=0x2b1adcb7
NOTE: cache began mount (first) of group DATA2 number=2 incarn=0x2b1adcb7
NOTE: Assigning number (2,1) to disk (ORCL:DISK003B)
Thu Oct 01 14:43:59 2009
NOTE: start heartbeating (grp 2)
Thu Oct 01 14:43:59 2009
kfdp_query(DATA2): 31
kfdp_queryBg(): 31
NOTE: Assigning number (2,0) to disk ()
kfdp_query(DATA2): 32
kfdp_queryBg(): 32
NOTE: cache opening disk 1 of grp 2: DISK003B label:DISK003B
NOTE: F1X0 found on disk 1 au 2 fcn 0.0
NOTE: cache mounting (first) normal redundancy group 2/0x2B1ADCB7 (DATA2)
Thu Oct 01 14:44:00 2009
* allocate domain 2, invalid = TRUE
Thu Oct 01 14:44:00 2009
NOTE: attached to recovery domain 2
NOTE: cache recovered group 2 to fcn 0.3769
Thu Oct 01 14:44:00 2009
NOTE: LGWR attempting to mount thread 1 for diskgroup 2 (DATA2)
NOTE: LGWR found thread 1 closed at ABA 32.395
NOTE: LGWR mounted thread 1 for diskgroup 2 (DATA2)
NOTE: LGWR opening thread 1 at fcn 0.3769 ABA 33.396
NOTE: cache mounting group 2/0x2B1ADCB7 (DATA2) succeeded
NOTE: cache ending mount (success) of group DATA2 number=2 incarn=0x2b1adcb7
Thu Oct 01 14:44:00 2009
kfdp_query(DATA2): 33
kfdp_queryBg(): 33
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 2
SUCCESS: diskgroup DATA2 was mounted
SUCCESS: alter diskgroup data2 mount force
Thu Oct 01 14:44:00 2009
NOTE: diskgroup resource ora.DATA2.dg is online
Thu Oct 01 14:45:38 2009
WARNING: Disk (DISK003A) will be dropped in: (12407) secs on ASM inst: (1)
GMON SlaveB: Deferred DG Ops completed.

We query v$asm_disk again:

SQL> set pages 40000 lines 120
SQL> col PATH for a30
SQL> select DISK_NUMBER,MOUNT_STATUS,HEADER_STATUS,MODE_STATUS,STATE,PATH FROM V$ASM_DISK;

DISK_NUMBER MOUNT_S HEADER_STATU MODE_ST STATE    PATH
----------- ------- ------------ ------- -------- ------------------------------
 0 CLOSED  MEMBER       ONLINE  NORMAL   ORCL:DISK001A
 0 MISSING UNKNOWN      OFFLINE NORMAL
 1 CACHED  MEMBER       ONLINE  NORMAL   ORCL:DISK001B
 2 CACHED  MEMBER       ONLINE  NORMAL   ORCL:DISK002A
 3 CACHED  MEMBER       ONLINE  NORMAL   ORCL:DISK002B
 1 CACHED  MEMBER       ONLINE  NORMAL   ORCL:DISK003B

Identify the new/missing disk

ASM recognized there is one disk missing…. but how to identify this disk? I found the following script and modified it slightly for use with current asm library:

/etc/init.d/oracleasm querydisk -d `/etc/init.d/oracleasm listdisks -d` | \
cut -f2,10,11 -d" " | \
perl -pe 's/"(.*)".*\[(.*), *(.*)\]/$1 $2 $3/g;' | \
while read v_asmdisk v_minor v_major
do
v_device=`ls -la /dev | grep " $v_minor, *$v_major " | awk '{print $10}'`
echo "ASM disk $v_asmdisk based on /dev/$v_device [$v_minor, $v_major]"
done

It will report all recognozed ASM disks and their device path:

ASM disk DISK001A based on /dev/sdc1 [8, 33]
ASM disk DISK001B based on /dev/sdb1 [8, 17]
ASM disk DISK002A based on /dev/sdd1 [8, 49]
ASM disk DISK002B based on /dev/sde1 [8, 65]
ASM disk DISK003B based on /dev/sdf1 [8, 81]

Based on this information we can subtract the ASM devices from the available devices and conclude that the missing device is /dev/sdg1. This approach might be not effective but it will produce an overview fast. Now it is up to the administrator to verify the device to be used. Remember that multipathing devices might the sen several times.

Reuse the same ASM disk name

Label the ASM disk

We will re-use /dev/sdg1 and label it as ASM disk with its old label “DISK003A”:

[root@rac1 ~]# oracleasm createdisk DISK003A /dev/sdg1
Writing disk header: done
Instantiating disk: done

[root@rac1 ~]# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks...
Scanning system for ASM disks...

[root@rac1 ~]# oracleasm listdisks
DISK001A
DISK001B
DISK002A
DISK002B
DISK003A
DISK003B

So far – no problem.

Add the disk to the disk group

Now we need to add the disk to the disk group:

SQL> alter diskgroup data2 add disk 'ORCL:DISK003A';
alter diskgroup data2 add disk 'ORCL:DISK003A'
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15020: discovered duplicate ASM disk "DISK003A"

Well, this error message is not wrong. ASM remembers former disks and prevent disks to be re-used with the same name. You can however force to re-use the disk by:

  1. Drop the offline disk from the disk group
  2. Add the new disk to the disk group

1. Drop the offline disk from the disk group

SQL> alter diskgroup data2 drop disk DISK003A force;
Diskgroup altered.

ASM alert.log showing:

Thu Oct 01 15:24:59 2009
SQL> alter diskgroup data2 drop disk DISK003A force
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=2
NOTE: initiating PST update: grp = 2
kfdp_update(): 34
Thu Oct 01 15:25:02 2009
kfdp_updateBg(): 34
NOTE: group DATA2: updated PST location: disk 0001 (PST copy 0)
NOTE: PST update grp = 2 completed successfully
Thu Oct 01 15:25:02 2009
NOTE: membership refresh pending for group 2/0x2b1adcb7 (DATA2)
kfdp_query(DATA2): 35
kfdp_queryBg(): 35
SUCCESS: refreshed membership for 2/0x2b1adcb7 (DATA2)
NOTE: starting rebalance of group 2/0x2b1adcb7 (DATA2) at power 1
SUCCESS: alter diskgroup data2 drop disk DISK003A force
Starting background process ARB0
Thu Oct 01 15:25:05 2009
ARB0 started with pid=27, OS id=24150
NOTE: assigning ARB0 to group 2/0x2b1adcb7 (DATA2)
NOTE: F1X0 copy 1 relocating from 0:2 to 1:2 for diskgroup 2 (DATA2)
NOTE: F1X0 copy 2 relocating from 1:2 to 0:2 for diskgroup 2 (DATA2)
NOTE: stopping process ARB0
Thu Oct 01 15:25:12 2009
SUCCESS: rebalance completed for group 2/0x2b1adcb7 (DATA2)
Thu Oct 01 15:25:12 2009
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=2
NOTE: initiating PST update: grp = 2
kfdp_update(): 36
Thu Oct 01 15:25:15 2009
kfdp_updateBg(): 36
NOTE: group DATA2: updated PST location: disk 0001 (PST copy 0)
NOTE: PST update grp = 2 completed successfully
WARNING: offline disk number 0 has references (3433 AUs)
NOTE: initiating PST update: grp = 2
kfdp_update(): 37
kfdp_updateBg(): 37
NOTE: group DATA2: updated PST location: disk 0001 (PST copy 0)
NOTE: PST update grp = 2 completed successfully
NOTE: membership refresh pending for group 2/0x2b1adcb7 (DATA2)
kfdp_query(DATA2): 38
kfdp_queryBg(): 38
SUCCESS: refreshed membership for 2/0x2b1adcb7 (DATA2)
Thu Oct 01 15:24:59 2009
SQL> alter diskgroup data2 drop disk DISK003A force
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=2
NOTE: initiating PST update: grp = 2
kfdp_update(): 34
Thu Oct 01 15:25:02 2009
kfdp_updateBg(): 34
NOTE: group DATA2: updated PST location: disk 0001 (PST copy 0)
NOTE: PST update grp = 2 completed successfully
Thu Oct 01 15:25:02 2009
NOTE: membership refresh pending for group 2/0x2b1adcb7 (DATA2)
kfdp_query(DATA2): 35
kfdp_queryBg(): 35
SUCCESS: refreshed membership for 2/0x2b1adcb7 (DATA2)
NOTE: starting rebalance of group 2/0x2b1adcb7 (DATA2) at power 1
SUCCESS: alter diskgroup data2 drop disk DISK003A force
Starting background process ARB0
Thu Oct 01 15:25:05 2009
ARB0 started with pid=27, OS id=24150
NOTE: assigning ARB0 to group 2/0x2b1adcb7 (DATA2)
NOTE: F1X0 copy 1 relocating from 0:2 to 1:2 for diskgroup 2 (DATA2)
NOTE: F1X0 copy 2 relocating from 1:2 to 0:2 for diskgroup 2 (DATA2)
NOTE: stopping process ARB0
Thu Oct 01 15:25:12 2009
SUCCESS: rebalance completed for group 2/0x2b1adcb7 (DATA2)
Thu Oct 01 15:25:12 2009
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=2
NOTE: initiating PST update: grp = 2
kfdp_update(): 36
Thu Oct 01 15:25:15 2009
kfdp_updateBg(): 36
NOTE: group DATA2: updated PST location: disk 0001 (PST copy 0)
NOTE: PST update grp = 2 completed successfully
WARNING: offline disk number 0 has references (3433 AUs)
NOTE: initiating PST update: grp = 2
kfdp_update(): 37
kfdp_updateBg(): 37
NOTE: group DATA2: updated PST location: disk 0001 (PST copy 0)
NOTE: PST update grp = 2 completed successfully
NOTE: membership refresh pending for group 2/0x2b1adcb7 (DATA2)
kfdp_query(DATA2): 38
kfdp_queryBg(): 38
SUCCESS: refreshed membership for 2/0x2b1adcb7 (DATA2)

2. Add the new disk to the disk group

SQL> alter diskgroup data2 add disk 'ORCL:DISK003A';
Diskgroup altered.

With the ASM alert.log showing:

Thu Oct 01 15:27:22 2009
SQL> alter diskgroup data2 add disk 'ORCL:DISK003A'
NOTE: Assigning number (2,2) to disk (ORCL:DISK003A)
NOTE: requesting all-instance membership refresh for group=2
NOTE: initializing header on grp 2 disk DISK003A
NOTE: cache opening disk 2 of grp 2: DISK003A label:DISK003A
NOTE: requesting all-instance disk validation for group=2
Thu Oct 01 15:27:25 2009
NOTE: disk validation pending for group 2/0x2b1adcb7 (DATA2)
SUCCESS: validated disks for 2/0x2b1adcb7 (DATA2)
NOTE: initiating PST update: grp = 2
kfdp_update(): 39
Thu Oct 01 15:27:28 2009
kfdp_updateBg(): 39
NOTE: group DATA2: updated PST location: disk 0001 (PST copy 0)
NOTE: group DATA2: updated PST location: disk 0002 (PST copy 1)
NOTE: PST update grp = 2 completed successfully
NOTE: membership refresh pending for group 2/0x2b1adcb7 (DATA2)
kfdp_query(DATA2): 40
kfdp_queryBg(): 40
kfdp_query(DATA2): 41
kfdp_queryBg(): 41
SUCCESS: refreshed membership for 2/0x2b1adcb7 (DATA2)
NOTE: starting rebalance of group 2/0x2b1adcb7 (DATA2) at power 1
Starting background process ARB0
SUCCESS: alter diskgroup data2 add disk 'ORCL:DISK003A'
Thu Oct 01 15:27:31 2009
ARB0 started with pid=35, OS id=24288
NOTE: assigning ARB0 to group 2/0x2b1adcb7 (DATA2)
NOTE: F1X0 copy 2 relocating from 0:2 to 2:2 for diskgroup 2 (DATA2)

After that ASM is restoring your chosen redundancy by rebalancing the extents. After successfull operation the following messages appear:

Thu Oct 01 15:37:45 2009
NOTE: stopping process ARB0
SUCCESS: rebalance completed for group 2/0x2b1adcb7 (DATA2)
Thu Oct 01 15:37:47 2009
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=2
NOTE: initiating PST update: grp = 2
kfdp_update(): 42
Thu Oct 01 15:37:50 2009
kfdp_updateBg(): 42
NOTE: group DATA2: updated PST location: disk 0001 (PST copy 0)
NOTE: group DATA2: updated PST location: disk 0002 (PST copy 1)
NOTE: PST update grp = 2 completed successfully
SUCCESS: disk number 0 force dropped offline
NOTE: initiating PST update: grp = 2
kfdp_update(): 43
kfdp_updateBg(): 43
NOTE: group DATA2: updated PST location: disk 0001 (PST copy 0)
NOTE: group DATA2: updated PST location: disk 0002 (PST copy 1)
NOTE: PST update grp = 2 completed successfully
NOTE: De-assigning number (2,0) from disk ()
NOTE: membership refresh pending for group 2/0x2b1adcb7 (DATA2)
kfdp_query(DATA2): 44
kfdp_queryBg(): 44
SUCCESS: refreshed membership for 2/0x2b1adcb7 (DATA2)

Check ASM disk operation

You can check ASM operations with the following query:

SQL> select GROUP_NUMBER, OPERATION, STATE, ACTUAL, SOFAR, EST_MINUTES from v$asm_operation;

If there is a operating going on (like rebalancing) the query will return some rows. For instance for our just added disk we might get:

GROUP_NUMBER OPERA STAT     ACTUAL      SOFAR EST_MINUTES
------------ ----- ---- ---------- ---------- -----------
 2           REBAL RUN           1         49          16

8 thoughts on “ASM resilvering – or – how to recover your crashed cluster – Test no 3”

  1. Hi,

    Good observation,

    I am facing the same problem now i am able to solve with your article.

    Keep up the good work.

  2. I have a configuration where full device-path-names were used to create ASM diskgroups (Oracle DB on Solaris 10).
    Recently the LUN IDs were changed on the SAN storage and so the disk-device-names changed (on the Solaris 10 host) and so the ASM could not find/mount the disks for the diskgroups.
    I have tried setting the asm_diskstring parameter to a wildcard but that didn’t work (as I suspected). Assuming asm_diskstring is set, the question is does ASM actually search for the member disks in a diskgroup that was created using full disk-device-paths or does ASM just only looks for those devices exactly as they are specified? The rename_dg in 11g has a asm_diskstring parameter that can be set at the diskgroup level and which causes a rediscovery of the disks in a diskgroup, but unfortunately I am on 10gR2. Is there a way to do something similar via ASM or some other Oracle utility? Thanks.

    SQL> select adg.name dg_name, ad.name fg_name, path from v$asm_disk ad
    right outer join v$ASM_DISKGROUP adg
    on ad.group_number=adg.group_number;

    DG_NAME FG_NAME PATH
    ————— ———————- ————————————
    DATA1 DATA1_0001 /dev/rdsk/c2t50XXXXXXXXX94FC2d2s0
    DATA1 DATA1_0002 /dev/rdsk/c2t50XXXXXXXXX94FC2d3s0
    DATA1 DATA1_0000 /dev/rdsk/c2t50XXXXXXXXX94FC2d1s0
    DATA1 DATA1_0003 /dev/rdsk/c2t50XXXXXXXXX94DECd11s0
    DATA DATA_0000 /dev/rdsk/c2t50XXXXXXXXX94DECd0s0
    DATA DATA_0001 /dev/rdsk/c2t50XXXXXXXXX94DECd1s0
    FRAD FRADISK5 /dev/rdsk/c2t50XXXXXXXXX94DECd10s0
    FRAD FRAD_0005 /dev/rdsk/c2t50XXXXXXXXX94DECd12s0
    FRAD FRADISK6 /dev/rdsk/c2t50XXXXXXXXX94FC2d0s0
    FRAD FRADISK3 /dev/rdsk/c2t50XXXXXXXXX94DECd9s0
    FRAD FRADISK1 /dev/rdsk/c2t50XXXXXXXXX94DECd2s0
    FRAD FRSDISK2 /dev/rdsk/c2t50XXXXXXXXX94DECd3s0

    12 rows selected.

    SQL>

    The LUN ID is roughly the 1 or 2 digit number after the small “d” in the device (PATH) names. Some of them changed at the O/S level when LUN ID was changed on the SAN storage.

    I was thinking maybe if I run the following sample command (let’s say the device-name for FRDISK6 changed for example):

    /etc/init.d/oracleasm force-renamedisk /dev/rdsk/c2t50XXXXXXXXXXXFC2d5s0 FRDISK6

    (where /dev/rdsk/c2t50XXXXXXXXXXXFC2d5s0 is the new O/S device for the disk)

    Do you think this update ASM as required (including whatever data structures/tables/views) such that the diskgroup(s) will be mountable?

    1. > the question is does ASM actually search for the member disks in a diskgroup that was created using full disk-device-paths or does ASM just only looks for those devices exactly as they are specified?

      ASM is not fixed to device path. If device names change ASM recognizes this and is still able to mount the disk group.

      You already tried setting the asm_diskstring parameter. To what value exactly?
      Did you check the ASM permissions?
      Regarind the example you showed with force-rename. Do NOT do this – it wont change anything.

Leave a Reply

Your email address will not be published.