Table of Contents
Problem Statement
- OCR diskgroup contains 5 disks although diskgroup has external redundancy.
- Owing to 1 of the corrupt disk in the OCR diskgroup, diskgroup could not be mounted.
- Since diskgroup is not mounted, CRSD were crashed and cluster could not be started manually.
Diagnosis
Excerpts from ASM alert log
WARNING: cache read a corrupt block: group=3(OCR_VOTE) fn=2 blk=0 disk=4 (OCR_VOTE_0004) incarn=3852524076 au=10 blk=0 count=1 ORA-15196: invalid ASM block header [kfc.c:29757] [hard_kfbh] [2] [0] [0 != 130] WARNING: cache read (retry) a corrupt block: group=3(OCR_VOTE) fn=2 blk=0 disk=4 (OCR_VOTE_0004) incarn=3852524076 au=10 blk=0 count=1 ORA-15196: invalid ASM block header [kfc.c:29757] [hard_kfbh] [2] [0] [0 != 130] ORA-15196: invalid ASM block header [kfc.c:29757] [hard_kfbh] [2] [0] [0 != 130] WARNING: Failed to verify disk 4 (OCR_VOTE_0004) of group 3 (OCR_VOTE) path /dev/ASMDISKS/OCR_DBNAME_ST1_01 reason: hard_kfbh 0 != 130
Looking at above output, one of the ASM disk (disk=4 OCR_VOTE_0004) from OCR_VOTE diskgroup was corrupted and failing with tinted errors.
OCR_VOTE Diskgroup dismounted automatically with below errors.
ORA-15335: ASM metadata corruption detected in disk group 'OCR_VOTE' ORA-15130: diskgroup "OCR_VOTE" is being dismounted ORA-15066: offlining disk "OCR_VOTE_0004" in group "OCR_VOTE" may result in a data loss ORA-15196: invalid ASM block header [kfc.c:29757] [hard_kfbh] [2] [0] [0 != 130] ORA-15196: invalid ASM block header [kfc.c:29757] [hard_kfbh] [2] [0] [0 != 130] SUCCESS: diskgroup OCR_VOTE was dismounted SUCCESS: alter diskgroup OCR_VOTE dismount force /* ASM SERVER:286271857 */
Excerpts from CRS alert log
Inaccessibility of OCR caused CRSD process crashed and details were seen from CRS alert log.
2019-12-05 01:44:46.785 [CRSD(18789)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage layer error [Insufficient quorum to open OCR devices] [0]]. Details at (:CRSD00111:) in /GRIDHOME/app/grid/orabase/diag/crs/crs-hostname1/crs/trace/crsd.trc. 2019-12-05 01:44:46.954 [CRSD(18862)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 18862 2019-12-05 01:44:48.102 [CRSD(18862)]CRS-1013: The OCR location in an ASM disk group is inaccessible. Details in /GRIDHOME/app/grid/orabase/diag/crs/crs-hostname1/crs/trace/crsd.trc.
We tried to mount the diskgroup manually but it was failing with below errors.
SQL> alter diskgroup OCR_VOTE mount force ORA-15032: not all alterations performed ORA-15040: diskgroup is incomplete ORA-15042: ASM disk "4" is missing from group number "1" 2019-12-05T15:07:18.439094+05:30 ERROR: alter diskgroup OCR_VOTE mount force
So, what led to this????
We came to know that 1 of the OCR disk (/dev/ASMDISKS/OCR_DBNAME_ST1_01) out of the 5 disks was accidentally formatted using DD command.
OCR_VOTE diskgroup was created with EXTERNAL REDUNDANCY without any mirroring.
Absenteeism of failure groups were preventing diskgroup mount with missing disk.
Although as per the standards, a backup of OCR happens every 4 hours but the same is preserved in ASM diskgroup, which in our case was dismounted and therefore the backup could not be restored.
ocrconfig -showbackup PROT-26: Oracle Cluster Registry backup locations were retrieved from a local copy crs-hostname1 2019/12/04 22:20:31 +OCR_VOTE:/DBNAME/OCRBACKUP/backup00.ocr.286.1026166829 0 crs-hostname1 2019/12/04 18:20:28 +OCR_VOTE:/DBNAME/OCRBACKUP/backup01.ocr.288.1026152425 0 crs-hostname1 2019/12/04 14:20:24 +OCR_VOTE:/DBNAME/OCRBACKUP/backup02.ocr.289.1026138023 0 crs-hostname1 2019/12/03 02:19:53 +OCR_VOTE:/DBNAME/OCRBACKUP/day.ocr.282.1026008395 0 crs-hostname1 2019/11/22 14:16:14 +OCR_VOTE:/DBNAME/OCRBACKUP/week.ocr.284.1025014575 0 PROT-25: Manual backups for the Oracle Cluster Registry are not available
What Next????
Further, we tried to check the header of the corrupt disk using kfed and simulate the corrupt header to be the same as the one of the available OCR disks using kfed merge.
These efforts went in vain since the entire data was wiped out and not just the header.
This corruption was introduced using DD command.
Following DOC ID refereed to perform recovery steps
How to Restore/Repair/Fix An Overwritten (KFBTYP_INVALID) ASM Disk Header (First 4K) 10.2.0.5, 11.1.0.7, 11.2 And Onward (Doc ID 1088867.1)
KFED steps followed
kfed read /dev/ASMDISKS/OCR_DBNAME_ST1_02 | egrep 'ausize|dsknum|dskname|grpname|fgname' ===> Good disk kfdhdb.dsknum: 3 ; 0x024: 0x0003 kfdhdb.dskname: OCR_VOTE_0003 ; 0x028: length=13 kfdhdb.grpname: OCR_VOTE ; 0x048: length=8 kfdhdb.fgname: OCR_VOTE_0003 ; 0x068: length=13 kfdhdb.ausize: 4194304 ; 0x0bc: 0x00400000 kfed repair /dev/ASMDISKS/OCR_DBNAME_ST1_01 ausz=4194304
We tried to patch the disk header using KFED merge option.
kfed read /dev/ASMDISKS/OCR_DBNAME_ST1_02 > /tmp/OCR_DBNAME_ST1_02.log
cp -p /tmp/OCR_DBNAME_ST1_02.log /tmp/OCR_DBNAME_ST1_01.log
Modified few entries from text file /tmp/OCR_DBNAME_ST1_01.log
Problamatic disk - disk 4 (OCR_VOTE_0004)
kfdhdb.dsknum: 4 ; 0x024: 0x0004
kfdhdb.dskname: OCR_VOTE_0004 ; 0x028: length=13
kfdhdb.fgname: OCR_VOTE_0004 ; 0x068: length=13
kfbh.block.obj: 2147483652 ; 0x008: disk=4
(Note: Value DECIMAL-2147483652 HEX-0x80000004. End digit signifies file number i.e. 0x8000000"4" which can be then converted from HEX to DECIMAL)
kfed merge /dev/ASMDISKS/OCR_DBNAME_ST1_01 text=/tmp/OCR_DBNAME_ST1_01.log
MERGE completed successfully but diskgroup was still not mounting.
We liaised with the Oracle Global Support team wherein they initially suggested a complete reconfigure of the cluster.
But we still want to give a try so that at least OCR backup can be restore.
Solution Implemented
With the help of Oracle Global Support team, an alternative using AMDU utility was suggested.
We then extracted a backup of OCR using AMDU utility and using this backup, a copy of OCR was restored in a new HIGH redundancy diskgroup.
How AMDU utility used to extract the OCR backup?
cd /tmp amdu -diskstring '/dev/ASMDISKS/OCR_DBNAME_ST1*' -extract ocr_vote.286 cd amdu_2019_12_05_16_59_23 ls -lrt -rw-r--r-- 1 grid oinstall 2424832 Dec 5 16:59 OCR_VOTE_286.f -rw-r--r-- 1 grid oinstall 9455 Dec 5 16:59 report.txt
Using above command OCR backup has been extracted from available disks in offline mode.
Referring the AMDU help for the extract option, we could delineate the following:
amdu -help extract Files to extract -extract <diskgroup>.<file_number>: This extracts the numbered file from the named diskgroup, case insensitive. This option may be specified multiple times to extract multiple files. The extracted file is placed in the dump directory under the name <diskgroup>_<number>.f where <diskgroup> is the diskgroup name in uppercase, and <number> is the file number. The -output option may be used to write the file to any location. The extracted file will appear to have the same contents it would have if accessed through the database. If some portion of the file is unavailable then that portion of the output file will be filled with 0xBADFDA7A, and a message will appear on stderr.
Above content specifies <diskgroup>.<file_number>
In our case it was ocr_vote.286
What is this file_number 286? How support team has decided this number to extract OCR backup files?
Here we again referred OCR backup files.
crs-hostname1 2019/12/04 22:20:31 +OCR_VOTE:/DBNAME/OCRBACKUP/backup00.ocr.286.1026166829 0 crs-hostname1 2019/12/04 18:20:28 +OCR_VOTE:/DBNAME/OCRBACKUP/backup01.ocr.288.1026152425 0 crs-hostname1 2019/12/04 14:20:24 +OCR_VOTE:/DBNAME/OCRBACKUP/backup02.ocr.289.1026138023 0 crs-hostname1 2019/12/03 02:19:53 +OCR_VOTE:/DBNAME/OCRBACKUP/day.ocr.282.1026008395 0 crs-hostname1 2019/11/22 14:16:14 +OCR_VOTE:/DBNAME/OCRBACKUP/week.ocr.284.1025014575 0
Number 286 picked from the latest OCR backup as highlighted.
How to Restore ASM Based OCR when OCR backup is located in ASM diskgroup? (Doc ID 2569847.1)
This amdu extract command will create a OCR backup image file on the current directory. Now the OCR can be restored using the procedure given in below Doc ID
How to Restore ASM Based OCR After Complete Loss of the CRS Diskgroup on Linux/Unix Systems (Doc ID 1062983.1)
The spfile and voting disks were successfully moved however, owing to no guidelines for moving the ASM password file, moving of ASM Password file was missed.
ASM password file has been recreated using below Doc ID
How to recreate shared ASM password file in 12c GI cluster (Doc ID 1929673.1)
Great,Thanks sir for Demo logs
Very well explained the problem statement and solution implemented. Thanks for sharing this !!!
Good explanation n troubleshooting of the issue faced and nicely presented.
Thanks for sharing
Nicely explained
Great… Nice information..
Thanks for sharing the same ?
Thanks for sharing!
Precis and explanation being detailed helps in understanding. Thanks for sharing.