Table of Contents
Problem Statement
After carrying out scheduled maintenance activity of Exadata Cell Nodes (Battery replacement), Cluster services and ASM instances started successfully.
But Database instance was unable to start as DATA_DEMO Diskgroup was not mounted.
Error from ASM alert log:
alter diskgroup DATA_DEMO mount
NOTE: cache registered group DATA_DEMO number=1 incarn=0x966d8269
NOTE: cache began mount (first) of group DATA_DEMO number=1 incarn=0x966d8269
NOTE: Assigning number (1,35) to disk (o/192.xxx.xxx.xxx/DATA_DEMO_CD_11_demoss03)
NOTE: Assigning number (1,38) to disk (o/192.xxx.xxx.xxx/DATA_DEMO_CD_01_demoss03)
.
.
.
ERROR: diskgroup DATA_DEMO was not mounted
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "31" is missing from group number "1"
ERROR: alter diskgroup DATA_DEMO mount
Diagnosis
Activities carried out during maintenance
- DB and Cluster services brought down with complete downtime taken.
- Cell Node #1 brought down to replace the battery.
- Battery replaced and cell node started.
- Same steps were performed in rolling manner for Cell Node #2 & #3.
While cluster startup, CRS services and ASM instance has been started but database was unable to start.
Post examination, we perceived DATA_DEMO diskgroup was not mounted and failed with below error.
ERROR: diskgroup DATA_DEMO was not mounted
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "31" is missing from group number "1"
ERROR: alter diskgroup DATA_DEMO mount
We strained mounting the diskgroup manually but failed with same error.
SQL> alter diskgroup DATA_DEMO mount;
ERROR at line 1: ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "31" is missing from group number "1"
At ASM level, we detected Disk MOUNT_STATUS was “MISSING/CLOSED” disk with “UNKNOWN/MEMBER” status for Diskgroup 3 and 0.
GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU MODE_ST NAME FAILGROUP PATH
------------ ----------- ------- ------------ ------- ------------------------- ----------- ---------------------------------------
3 31 MISSING UNKNOWN OFFLINE RECO_DEMO_CD_10_DEMO_SS03 DEMOSS03
0 31 CLOSED MEMBER ONLINE DEMOSS01 o/192.xxx.xxx.xxx/DATA_DEMO_CD_05_demoss01
MISSING status specifies that disk is known to be part of the ASM disk group, but no disk in the storage system was found with the indicated name.
The GROUP_NUMBER and DISK_NUMBER columns will only be valid if the disk is part of a disk group which is currently mounted by the instance.
Otherwise, GROUP_NUMBER will be 0, and DISK_NUMBER will be a unique value with respect to the other disks that also have a group number of 0.
As disks were MISSING, we started troubleshooting at CELLNODES to identify GRIDDISK status.
On one of the CELLNODE (#3) we recognized DUPLICATE CELLDISK with DISKGROUP asmmodestatus was UNKNOWN/UN– USED
CellCLI> list celldisk
CD_10_demoss03_duplicate_name importForceRequired
CellCLI> list griddisk attributes name, asmmodestatus,asmdeactivationoutcome;
DATA_DEMO_CD_00_demoss03 UNKNOWN Yes
DATA_DEMO_CD_01_demoss03 UNKNOWN Yes
DATA_DEMO_CD_02_demoss03 UNKNOWN Yes
DATA_DEMO_CD_03_demoss03 UNKNOWN Yes
DATA_DEMO_CD_04_demoss03 UNKNOWN Yes
DATA_DEMO_CD_05_demoss03 UNKNOWN Yes
DATA_DEMO_CD_06_demoss03 UNKNOWN Yes
DATA_DEMO_CD_07_demoss03 UNKNOWN Yes
DATA_DEMO_CD_08_demoss03 UNKNOWN Yes
DATA_DEMO_CD_09_demoss03 UNKNOWN Yes
DATA_DEMO_CD_10_demoss03 UNKNOWN Yes
DATA_DEMO_CD_10_demoss03_duplicate_name UN-- USED Yes
DATA_DEMO_CD_11_demoss03 UNKNOWN Yes
DBFS_DG_CD_02_demoss03 UNKNOWN Yes
DBFS_DG_CD_03_demoss03 UNKNOWN Yes
DBFS_DG_CD_04_demoss03 UNKNOWN Yes
DBFS_DG_CD_05_demoss03 UNKNOWN Yes
DBFS_DG_CD_06_demoss03 UNKNOWN Yes
DBFS_DG_CD_07_demoss03 UNKNOWN Yes
DBFS_DG_CD_08_demoss03 UNKNOWN Yes
DBFS_DG_CD_09_demoss03 UNKNOWN Yes
DBFS_DG_CD_10_demoss03 UN-- USED Yes
DBFS_DG_CD_11_demoss03 UNKNOWN Yes
RECO_DEMO_CD_00_demoss03 UNKNOWN Yes
RECO_DEMO_CD_01_demoss03 UNKNOWN Yes
RECO_DEMO_CD_02_demoss03 UNKNOWN Yes
RECO_DEMO_CD_03_demoss03 UNKNOWN Yes
RECO_DEMO_CD_04_demoss03 UNKNOWN Yes
RECO_DEMO_CD_05_demoss03 UNKNOWN Yes
RECO_DEMO_CD_06_demoss03 UNKNOWN Yes
RECO_DEMO_CD_07_demoss03 UNKNOWN Yes
RECO_DEMO_CD_08_demoss03 UNKNOWN Yes
RECO_DEMO_CD_09_demoss03 UNKNOWN Yes
RECO_DEMO_CD_10_demoss03 UNKNOWN Yes
RECO_DEMO_CD_10_demoss03_duplicate_name UN-- USED Yes
RECO_DEMO_CD_11_demoss03 UNKNOWN Yes
Other CELLNODES (#1, #2) shows ONLINE status for all GRIDDISKS
CellCLI> list griddisk attributes name, asmmodestatus,asmdeactivationoutcome;
DATA_DEMO_CD_00_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
DATA_DEMO_CD_01_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
DATA_DEMO_CD_02_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
DATA_DEMO_CD_03_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
DATA_DEMO_CD_04_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
DATA_DEMO_CD_05_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
DATA_DEMO_CD_06_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
DATA_DEMO_CD_07_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
DATA_DEMO_CD_08_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
DATA_DEMO_CD_09_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
DATA_DEMO_CD_10_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
DATA_DEMO_CD_11_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
DBFS_DG_CD_02_demoss01 ONLINE Yes
DBFS_DG_CD_03_demoss01 ONLINE Yes
DBFS_DG_CD_04_demoss01 ONLINE Yes
DBFS_DG_CD_05_demoss01 ONLINE Yes
DBFS_DG_CD_06_demoss01 ONLINE Yes
DBFS_DG_CD_07_demoss01 ONLINE Yes
DBFS_DG_CD_08_demoss01 ONLINE Yes
DBFS_DG_CD_09_demoss01 ONLINE Yes
DBFS_DG_CD_10_demoss01 ONLINE Yes
DBFS_DG_CD_11_demoss01 ONLINE Yes
RECO_DEMO_CD_00_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
RECO_DEMO_CD_01_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
RECO_DEMO_CD_02_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
RECO_DEMO_CD_03_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
RECO_DEMO_CD_04_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
RECO_DEMO_CD_05_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
RECO_DEMO_CD_06_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
RECO_DEMO_CD_07_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
RECO_DEMO_CD_08_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
RECO_DEMO_CD_09_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
RECO_DEMO_CD_10_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
RECO_DEMO_CD_11_demoss01 ONLINE "Cannot deactivate due to other offline disks in the diskgroup"
To troubleshoot DUPLICATE CELLDISK issue, we referred below Doc ID
After disk replacement cell disk name is appearing like “%duplicate%” and list griddisk status is showing “importRequired” (Doc ID 1485882.1)
We tried importing affected disk forcefully but it did not help.
CellCLI> import celldisk CD_10_demoss03 force
CELL-04560: Cannot complete import of cell disk CD_10_demoss03.
Received Error: CELL-02525: Unknown cell disk: CD_10_demoss03
Cell disks not imported: Celldisk: CD_10_demoss03
As mentioned in the document, dropping CELLDISK was a critical call and before doing it we thought of trying couple of other workaround to overcome this issue as below.
- Restart of CELL services
It was attempted but did not yield the expected output .
ALTER CELL RESTART SERVICES ALL
- CELLNODE reboot
Didn’t come to our rescue and resulted in the same %duplicate_name% appended to the cell disk issue .
- DETACH/ATTACH the PHYSICAL DISK
With help of Field Engineer, affected CELLDISK was detached.
CellCLI> ALTER PHYSICALDISK 20:10 DROP FOR REPLACEMENT
Above command will perform checks to determine whether it is safe to remove the disk, then prepares it for replacement. Upon its successful completion, the blue LED on the disk will have been turned ON.
Note that the command does not initiate a drop of the griddisk from ASM, instead opting for an offline of the disks. As such, a rebalance will not occur as a result of this command being run. The advantage to this approach is that the disk can be replaced as soon as the command completes and that a single rebalance operation will occur, the one that takes place after the new disk has been inserted.
Impacted disk was DETACHED/RE-ATTACHED and the same was re-enabled.
CellCLI> ALTER PHYSICALDISK 20:10 REENABLE
(Note: Doc ID Referred – Things to Check in ASM When Replacing an ONLINE disk from Exadata Storage Cell (Doc ID 1326611.1))
Even after the above exercise , same issue surfaced back again.
Considering the downtime crunch, the penultimate option of re-creating the CELLDISK along with their respective GRIDDISK was exercised.
So, what led to this????
Below Bug has been matched with our scenario:
Bug 20376560: ORA-15040, ORA-15066 AND ORA-15042, ASM NOT ABLE TO MOUNT DISKGROUP
When the cell started it was unable to locate the cell disk CD_10_demoss03
Probably due to (misconfiguration OR unable to find the entry of CD_10_demoss03) in the file $OSSCONF/cell_disk_config.xml
And hence the cell disk got renamed to CELL_DISK_duplicate_name
[ossmgmt] [WARNING] [] [ms.core.MSCellDisk] [tid: 9] [ecid: 10.xxx.xxx.xxx:60766:1550947351798:0,0] getAttribute: Failed to get cell disk CD_10_demoss03's status null, use CDH status
[ossmgmt] [NOTIFICATION] [] [ms.core.MSCellDisk] [tid: 9] [ecid: 10.xxx.xxx.xxx:60766:1550947351798:0,0] getNewCDHState() cannot find the parent physical disk of cell diskCD_10_demoss03
[ossmgmt] [NOTIFICATION] [] [ms.core.MSCoreImpl] [tid: 9] [ecid: 10.xxx.xxx.xxx:60766:1550947351798:0,0] Renamed new CD CD_10_demoss03 to new name CD_10_demoss03_duplicate_name for this MS session.
[ossmgmt] [WARNING] [] [ms.core.MSCellDisk] [tid: 9] [ecid: 10.xxx.xxx.xxx:60766:1550947351798:0,0] getAttribute: Failed to get cell disk CD_10_demoss03_duplicate_name's status null, use CDH status
[ossmgmt] [WARNING] [] [ms.core.MSNetDisk] [tid: 9] [ecid: 10.xxx.xxx.xxx:60766:1550947351798:0,0] getAttribute: Failed to get cell disk DBFS_DG_CD_10_demoss03's status null, use CDH status
[ossmgmt] [WARNING] [] [ms.core.MSNetDisk] [tid: 9] [ecid: 10.xxx.xxx.xxx:60766:1550947351798:0,0] getAttribute: Failed to get cell disk RECO_DEMO_CD_10_demoss03_duplicate_name's status null, use CDH status
[ossmgmt] [WARNING] [] [ms.core.MSNetDisk] [tid: 9] [ecid: 10.xxx.xxx.xxx:60766:1550947351798:0,0] getAttribute: Failed to get cell disk DATA_DEMO_CD_10_demoss03_duplicate_name's status null, use CDH status
Solution Implemented
We have recreated CELLDISK/GRIDDISK following below steps:
Confirm current Physical Disk status
CellCLI> list physicaldisk
20:10 KHH75L normal
CellCLI> list lun
0_10 0_10 normal
CellCLI> list celldisk
CD_10_demoss03_duplicate_name importForceRequired
Query a different/same grid disk in order to pull the necessary information (name, size, offset)
CellCLI> list griddisk where celldisk=CD_10_demoss03 attributes name, size, offset
DATA_DEMO_CD_10_demoss03 256G 32M
DATA_DEMO_CD_10_demoss03_duplicate_name 256G 32M
RECO_DEMO_CD_10_demoss03 272.6875G 256.046875G
RECO_DEMO_CD_10_demoss03_duplicate_name 272.6875G 256.046875G
DBFS_DG_CD_10_demoss03 29.125G 528.734375G
Drop affected CELLDISK
CellCLI> drop celldisk CD_10_demoss03 force;
CellDisk CD_10_demoss03 successfully dropped
CellCLI> drop celldisk CD_10_demoss03_duplicate_name force
CellDisk CD_10_demoss03_duplicate_name successfully dropped
Create CELLDISK
CellCLI> create celldisk CD_10_demoss03 lun=0_10
CellDisk CD_10_demoss03 successfully created
Create the new grid disks in the order BASED ON THE OFFSET.
(Note: You must follow OFFSET order (32M first, then 256G next, finally 528G offset)
CellCLI> create griddisk DATA_DEMO_CD_10_demoss03 celldisk=CD_10_demoss03,size=256G
GridDisk DATA_DEMO_CD_10_demoss03 successfully created
CellCLI> create griddisk RECO_DEMO_CD_10_demoss03 celldisk=CD_10_demoss03,size=272.6875G
GridDisk RECO_DEMO_CD_10_demoss03 successfully created
CellCLI> create griddisk DBFS_DG_CD_10_demoss03 celldisk=CD_10_demoss03,size=29.125G
GridDisk DBFS_DG_CD_10_demoss03 successfully created
Check New CELLDISK/GRIDDISK status
CellCLI> list celldisk
CD_10_demoss03 normal
CellCLI> list griddisk attributes name, asmmodestatus,asmdeactivationoutcome;
DATA_DEMO_CD_10_demoss03 UN-- USED Yes
DBFS_DG_CD_10_demoss03 UN-- USED Yes
RECO_DEMO_CD_10_demoss03 UN-- USED Yes
CellCLI> list griddisk where celldisk=CD_10_demoss03 attributes name, size, offset
DATA_DEMO_CD_10_demoss03 256G 32M
DBFS_DG_CD_10_demoss03 29.125G 528.734375G
RECO_DEMO_CD_10_demoss03 272.6875G 256.046875G
Now ASM shows CANDIDATE status for newly created GRIDDISKS
GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU MODE_ST STATE NAME FAILGROUP PATH
------------ ----------- ------- ------------ ------- -------- ------------------------ ----------- ------------------------------
0 14 CLOSED CANDIDATE ONLINE NORMAL DEMOSS03 o/192.xxx.xx.x/DATA_DEMO_CD_10_demoss03
0 27 CLOSED CANDIDATE ONLINE NORMAL DEMOSS03 o/192.xxx.xx.x/RECO_DEMO_CD_10_demoss03
0 32 CLOSED CANDIDATE ONLINE NORMAL DEMOSS03 o/192.xxx.xx.x/DBFS_DG_CD_10_demoss03
3 31 MISSING UNKNOWN OFFLINE NORMAL RECO_DEMO_CD_10_DEMOSS03 DEMOSS03
1 31 MISSING UNKNOWN OFFLINE NORMAL DATA_DEMO_CD_10_DEMOSS03 DEMOSS03
Drop MISSING griddisks from ASM and add new griddisks with rebalance power(11)
alter diskgroup DATA_DEMO drop disk DATA_DEMO_CD_10_DEMOSS03 force;
alter diskgroup RECO_DEMO drop disk RECO_DEMO_CD_10_DEMOSS03 force;
alter diskgroup DATA_DEMO add disk 'o/192.xxx.xxx.xxx/DATA_DEMO_CD_10_demoss03' NAME DATA_DEMO_CD_10_DEMOSS03 rebalance power 11;
alter diskgroup RECO_DEMO add disk 'o/192.xxx.xxx.xxx/RECO_DEMO_CD_10_demoss03' NAME RECO_DEMO_CD_10_DEMOSS03 rebalance power 11;
Since there was no missing disk for DBFS_DG, we only added the disk as below;
alter diskgroup DBFS_DG add disk 'o/192.xxx.xxx.xxx/DBFS_DG_CD_10_demoss03' NAME DBFS_DG_CD_10_DEMOSS03 rebalance power 11;
Post the above, cluster on DB Node #2 was started to vindicate if DATA diskgroup mounts automatically . To much of our delight , the diskgroup was successfully mounted wifhout any manual intervention.
select group_number, NAME, STATE
from v$asm_diskgroup;
GROUP_NUMBER NAME STATE
------------ ----------- -----------
1 DATA_DEMO MOUNTED
2 DBFS_DG MOUNTED
3 RECO_DEMO MOUNTED
Further, we were able to start the databases whose datafiles resided on DATA diskgroup.
(Note: Doc ID – Steps to manually create cell/grid disks on Exadata if auto-create fails during disk replacement (Doc ID 1281395.1))
As mentioned in “So, what led to this????” section, suggested bug is fixed in Product Version: 12.1.2.2.0
Current Cell Version is 11.2.3.3.1
Patch and upgrade the Cell Storage servers.
Along with these we need to upgrade the underlying GI to 12C (since it’s on 11.2.0.3)
Great article
I like your both articles on Exadata issue.
After a long time I have seen such beautifully mentioned steps and details in a blog. Keep up the good work. Thank you.
Grt stuff!!!
Excellent.Thanks.
Excellent Explained.Thanks you Dinesh. Hope you will publish more of your analysis and technical stuffs .
Thanks
SD
Nice article