How to restore OCR diskgroup after 1 of the disk formatted and backups not accessible

Problem Statement
  1. OCR diskgroup contains 5 disks although diskgroup has external redundancy.
  2. Owing to 1 of the corrupt disk in the OCR diskgroup, diskgroup could not be mounted.
  3. Since diskgroup is not mounted, CRSD were crashed and cluster could not be started manually.
Diagnosis
Excerpts from ASM alert log
 
 WARNING: cache read a corrupt block: group=3(OCR_VOTE) fn=2 blk=0 disk=4 (OCR_VOTE_0004) incarn=3852524076 au=10 blk=0 count=1
 ORA-15196: invalid ASM block header [kfc.c:29757] [hard_kfbh] [2] [0] [0 != 130]
 WARNING: cache read (retry) a corrupt block: group=3(OCR_VOTE) fn=2 blk=0 disk=4 (OCR_VOTE_0004) incarn=3852524076 au=10 blk=0 count=1
 ORA-15196: invalid ASM block header [kfc.c:29757] [hard_kfbh] [2] [0] [0 != 130]
 ORA-15196: invalid ASM block header [kfc.c:29757] [hard_kfbh] [2] [0] [0 != 130]
 WARNING: Failed to verify disk 4 (OCR_VOTE_0004) of group 3 (OCR_VOTE) path /dev/ASMDISKS/OCR_DBNAME_ST1_01 reason: hard_kfbh 0 != 130 

Looking at above output, one of the ASM disk (disk=4 OCR_VOTE_0004) from OCR_VOTE diskgroup was corrupted and failing with tinted errors.
OCR_VOTE Diskgroup dismounted automatically with below errors.

 
 ORA-15335: ASM metadata corruption detected in disk group 'OCR_VOTE'
 ORA-15130: diskgroup "OCR_VOTE" is being dismounted
 ORA-15066: offlining disk "OCR_VOTE_0004" in group "OCR_VOTE" may result in a data loss 
 ORA-15196: invalid ASM block header [kfc.c:29757] [hard_kfbh] [2] [0] [0 != 130]
 ORA-15196: invalid ASM block header [kfc.c:29757] [hard_kfbh] [2] [0] [0 != 130]

 SUCCESS: diskgroup OCR_VOTE was dismounted
 SUCCESS: alter diskgroup OCR_VOTE dismount force /* ASM SERVER:286271857 */

Excerpts from CRS alert log

Inaccessibility of OCR caused CRSD process crashed and details were seen from CRS alert log.

 
 2019-12-05 01:44:46.785 [CRSD(18789)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage layer error [Insufficient quorum to open OCR devices] [0]]. Details at (:CRSD00111:) in /GRIDHOME/app/grid/orabase/diag/crs/crs-hostname1/crs/trace/crsd.trc. 
 2019-12-05 01:44:46.954 [CRSD(18862)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 18862
 2019-12-05 01:44:48.102 [CRSD(18862)]CRS-1013: The OCR location in an ASM disk group is inaccessible. Details in /GRIDHOME/app/grid/orabase/diag/crs/crs-hostname1/crs/trace/crsd.trc.

We tried to mount the diskgroup manually but it was failing with below errors.

 
 SQL> alter diskgroup OCR_VOTE mount force
 ORA-15032: not all alterations performed
 ORA-15040: diskgroup is incomplete
 ORA-15042: ASM disk "4" is missing from group number "1" 

 2019-12-05T15:07:18.439094+05:30
 ERROR: alter diskgroup OCR_VOTE mount force

So, what led to this????

We came to know that 1 of the OCR disk (/dev/ASMDISKS/OCR_DBNAME_ST1_01) out of the 5 disks was accidentally formatted using DD command.
OCR_VOTE diskgroup was created with EXTERNAL REDUNDANCY without any mirroring.
Absenteeism of failure groups were preventing diskgroup mount with missing disk.

Although as per the standards, a backup of OCR happens every 4 hours but the same is preserved in ASM diskgroup, which in our case was dismounted and therefore the backup could not be restored.

 
 ocrconfig -showbackup

 PROT-26: Oracle Cluster Registry backup locations were retrieved from a local copy
 crs-hostname1 2019/12/04 22:20:31 +OCR_VOTE:/DBNAME/OCRBACKUP/backup00.ocr.286.1026166829 0
 crs-hostname1 2019/12/04 18:20:28 +OCR_VOTE:/DBNAME/OCRBACKUP/backup01.ocr.288.1026152425 0
 crs-hostname1 2019/12/04 14:20:24 +OCR_VOTE:/DBNAME/OCRBACKUP/backup02.ocr.289.1026138023 0 
 crs-hostname1 2019/12/03 02:19:53 +OCR_VOTE:/DBNAME/OCRBACKUP/day.ocr.282.1026008395 0
 crs-hostname1 2019/11/22 14:16:14 +OCR_VOTE:/DBNAME/OCRBACKUP/week.ocr.284.1025014575 0
 PROT-25: Manual backups for the Oracle Cluster Registry are not available

What Next????

Further, we tried to check the header of the corrupt disk using kfed and simulate the corrupt header to be the same as the one of the available OCR disks using kfed merge.
These efforts went in vain since the entire data was wiped out and not just the header.

This corruption was introduced using DD command.

Following DOC ID refereed to perform recovery steps
How to Restore/Repair/Fix An Overwritten (KFBTYP_INVALID) ASM Disk Header (First 4K) 10.2.0.5, 11.1.0.7, 11.2 And Onward (Doc ID 1088867.1)

KFED steps followed
 
 kfed read /dev/ASMDISKS/OCR_DBNAME_ST1_02 | egrep 'ausize|dsknum|dskname|grpname|fgname' ===> Good disk 
 kfdhdb.dsknum:              3 ; 0x024: 0x0003
 kfdhdb.dskname: OCR_VOTE_0003 ; 0x028: length=13
 kfdhdb.grpname:      OCR_VOTE ; 0x048: length=8
 kfdhdb.fgname:  OCR_VOTE_0003 ; 0x068: length=13
 kfdhdb.ausize:        4194304 ; 0x0bc: 0x00400000

 kfed repair /dev/ASMDISKS/OCR_DBNAME_ST1_01 ausz=4194304

We tried to patch the disk header using KFED merge option.

 
 kfed read /dev/ASMDISKS/OCR_DBNAME_ST1_02 > /tmp/OCR_DBNAME_ST1_02.log
 cp -p /tmp/OCR_DBNAME_ST1_02.log /tmp/OCR_DBNAME_ST1_01.log

 Modified few entries from text file /tmp/OCR_DBNAME_ST1_01.log
 Problamatic disk - disk 4 (OCR_VOTE_0004)

 kfdhdb.dsknum:              4 ; 0x024: 0x0004
 kfdhdb.dskname: OCR_VOTE_0004 ; 0x028: length=13
 kfdhdb.fgname:  OCR_VOTE_0004 ; 0x068: length=13

 kfbh.block.obj:    2147483652 ; 0x008: disk=4
 (Note: Value DECIMAL-2147483652 HEX-0x80000004. End digit signifies file number i.e. 0x8000000"4" which can be then converted from HEX to DECIMAL) 

 kfed merge /dev/ASMDISKS/OCR_DBNAME_ST1_01 text=/tmp/OCR_DBNAME_ST1_01.log

MERGE completed successfully but diskgroup was still not mounting.

We liaised with the Oracle Global Support team wherein they initially suggested a complete reconfigure of the cluster.
But we still want to give a try so that at least OCR backup can be restore.

Solution Implemented

With the help of Oracle Global Support team, an alternative using AMDU utility was suggested.
We then extracted a backup of OCR using AMDU utility and using this backup, a copy of OCR was restored in a new HIGH redundancy diskgroup.

How AMDU utility used to extract the OCR backup?
 
 cd /tmp
 amdu -diskstring '/dev/ASMDISKS/OCR_DBNAME_ST1*' -extract ocr_vote.286 
 cd amdu_2019_12_05_16_59_23 
 ls -lrt 
 -rw-r--r-- 1 grid oinstall 2424832 Dec 5 16:59 OCR_VOTE_286.f 
 -rw-r--r-- 1 grid oinstall 9455 Dec 5 16:59 report.txt

Using above command OCR backup has been extracted from available disks in offline mode.
Referring the AMDU help for the extract option, we could delineate the following:

 
 amdu -help
 extract Files to extract
 -extract <diskgroup>.<file_number>: 
       This extracts the numbered file from the named diskgroup, case insensitive. 
       This option may be specified multiple times to extract multiple files. 
       The extracted file is placed in the dump directory under the name <diskgroup>_<number>.f 
       where <diskgroup> is the diskgroup name in uppercase, and <number> is the file number. 
       The -output option may be used to write the file to any location. 
       The extracted file will appear to have the same contents it would have if accessed through the database.  
       If some portion of the file is unavailable then that portion of the output file 
       will be filled with 0xBADFDA7A, and a message will appear on stderr.

Above content specifies <diskgroup>.<file_number>
In our case it was ocr_vote.286

What is this file_number 286? How support team has decided this number to extract OCR backup files?

Here we again referred OCR backup files.

 
 crs-hostname1 2019/12/04 22:20:31 +OCR_VOTE:/DBNAME/OCRBACKUP/backup00.ocr.286.1026166829 0
 crs-hostname1 2019/12/04 18:20:28 +OCR_VOTE:/DBNAME/OCRBACKUP/backup01.ocr.288.1026152425 0
 crs-hostname1 2019/12/04 14:20:24 +OCR_VOTE:/DBNAME/OCRBACKUP/backup02.ocr.289.1026138023 0 
 crs-hostname1 2019/12/03 02:19:53 +OCR_VOTE:/DBNAME/OCRBACKUP/day.ocr.282.1026008395 0
 crs-hostname1 2019/11/22 14:16:14 +OCR_VOTE:/DBNAME/OCRBACKUP/week.ocr.284.1025014575 0

Number 286 picked from the latest OCR backup as highlighted.

How to Restore ASM Based OCR when OCR backup is located in ASM diskgroup? (Doc ID 2569847.1)

This amdu extract command will create a OCR backup image file on the current directory. Now the OCR can be restored using the procedure given in below Doc ID
How to Restore ASM Based OCR After Complete Loss of the CRS Diskgroup on Linux/Unix Systems (Doc ID 1062983.1)

The spfile and voting disks were successfully moved however, owing to no guidelines for moving the ASM password file, moving of ASM Password file was missed.

ASM password file has been recreated using below Doc ID
How to recreate shared ASM password file in 12c GI cluster (Doc ID 1929673.1)

7 thoughts on “How to restore OCR diskgroup after 1 of the disk formatted and backups not accessible”

Leave a Reply

Your email address will not be published. Required fields are marked *