“Exadata Patching Failed” : ASM instance not starting after rescue

Problem Statement
  1. Exadata Patching failed on 1st Compute Node OS subsequently it was unable to boot. ILOM console was hanging on the prompt GRUB>
  2. After rescuing Compute Node, ASM instance failed to start preventing database instances startup on affected node.
  3. Exadata running with mixed Patch version [Compute node and IB Switch on old version whereas CELLNODE and DB/GRID home on newer version]
  4. Databases running on single node exposing the system to the likelihood of complete downtime if surviving node crash for any reason.
Diagnosis
Pre-checks done for below Exadata components
  • GRID & RDBMS Homes
  • Compute Nodes
  • Cell Nodes
  • IB Switches

All pre-check logs were error/conflict free. EXACHK output collected and verified.
(Note: Till date no Exadata Bundle patch had been applied to this Exadata box since being commissioned.)

MOS Doc ID for Exadata Patching
Exadata Database Machine and Exadata Storage Server Supported Versions (Doc ID 888828.1)

Herewith we were all set to go ahead with patching activity and proceeded with it.

Patching started

Activity completed till the issue surfaced:

  • GRID & RDBMS Homes Patching completed
  • CELLNODES Patching Completed

We progressed to compute node patching in rolling manner using the conventional dbnodeupdate.sh utility.

Current Image:
11.2.3.3.1.140529.1

Target Image:
12.2.1.1.8

The patch would do an upgrade of Compute node from Oracle Linux 5 to Oracle Linux 6 where in the compute node will undergo a series of reboots.

We pursued with the dbnodeupdate.sh and started compute node ILOM to track the progress.
In ILOM, we could see numerous changes happening in the background.
The compute node went for the first reboot. While reboot was in process, we waited for a while for node to come up completely and resume the OS patching.

Issues observed during Compute Node patching

However, in the ILOM console we could see the node stuck on the prompt GRUB>.

We thought it will take some time to reboot and hence waited for 15-20 minutes. Even after waiting for 45 minutes neither the compute node was not coming up nor there do any change in the ILOM console. It was just stuck on GRUB>.

Underneath actions were performed

  • A series of manual reboots of compute node.
  • Reset ILOM and power off compute node.

None of the above worked.

After some deliberation within the team, we decided to perform a Node Rescue using diagnostic ISO.
(Note: This is a risky operation and must be done carefully.)

MOS Doc ID
How to recover from a failed Linux Exadata DB Server dbnodeupdate or rollback (Doc ID 1952372.1)

We were able to salvage the compute node successfully to its preceding state of patching.

What Next????

There was some relief, when we rescued the 1st compute node, however it was short lived.
We faced another issue.
During cluster startup, CSSD process has been terminated and ASM instance was unable to start.
Errors extracted from Cluster alert log:

  
[ohasd(40083)]CRS-2112:The OLR service started on node demodbadm01.
2018-12-09 10:04:38.258: 
[ohasd(40083)]CRS-1301:Oracle High Availability Service started on node demodbadm01.
2018-12-09 10:04:38.258: 
[ohasd(40083)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2018-12-09 10:04:41.468: 
[/u01/app/11.2.0.4/grid/bin/orarootagent.bin(40125)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running). 
2018-12-09 10:04:45.658: 
[gpnpd(40270)]CRS-2328:GPNPD started on node demodbadm01. 
2018-12-09 10:04:47.979: 
[cssd(40366)]CRS-1713:CSSD daemon is started in clustered mode
2018-12-09 10:08:36.390: 
[cssd(40366)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0.4/grid/log/demodbadm01/cssd/ocssd.log
2018-12-09 10:12:40.538: 
[cssd(40366)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0.4/grid/log/demodbadm01/cssd/ocssd.log
2018-12-09 10:14:47.052: 
[/u01/app/11.2.0.4/grid/bin/cssdagent(40347)]CRS-5818:Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/11.2.0.4/grid/log/demodbadm01/agent/ohasd/oracssdagent_root//oracssdagent_root.log.  
2018-12-09 10:14:47.053: 
[cssd(40366)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0.4/grid/log/demodbadm01/cssd/ocssd.log

We assumed there to be some discrepancy at ASM/CLUSTER/CELL failure in Cluster and CELLNODE communication.

Few basic checks performed

  • ASM alert log
  • CRS Logs
  • Cell Node alert logs
  • OS logs
  • Diskmon logs
  • Oracle binary permission check and comparison with surviving node

Advanced checks

  • Clearing of socket files from /var/tmp/ i.e. /var/tmp/.oracle
  • Configuration comparison between affected node and surviving node(wrt ASM disks).
  • strace of kfod -> strace -aef -o /tmp/strace_kfodcmd_service_NW_restart.txt /u01/app/11.2.0.4/grid/bin/kfod op=disks disks=all asm_diskstring=o/*/*
  • RDS validation (inter-communication between cell GRID disks) -> lsmod | grep -i rds

Entire logs had shown the failure impact but there was no clue of any sort available to identify the root cause.
Issue could be at any layer.
Time was ticking and the customer started to lose patience.

Now what to do????

It was an undesirable situation.
Patching failed, ASM not starting, Exadata running with mixed Patch version [DB node and IB Switch on old version whereas CELLNODE and GRID/DB home on newer version] could lead to mysterious issues.

Our OS team has proposed to validate system firmware version as we have rollback compute node OS patching.
The command used “CheckHWnFWProfile” present in /opt/oracle.SupportTools directory.

So let’s witness what unfolded.

  
/opt/oracle.SupportTools/CheckHWnFWProfile -d
Server_Model=SUN_SERVER_X4-2
<BIOS_Info>
<BIOS>
<BIOSVendor VALUE="American Megatrends Inc."/>
<BIOSVersion VALUE="25010603"/>
</BIOS>
</BIOS_Info>
<Processor_Info>
<SocketCount VALUE="2"/>
<Processor SIGNATURE="Intel(R) Xeon(R) E5-2697 v2 2.70GHz" SLOT="P0">
<ProcessorModel VALUE="Intel(R) Xeon(R) E5-2697 v2 2.70GHz"/>
</Processor>
<Processor SIGNATURE="Intel(R) Xeon(R) E5-2697 v2 2.70GHz" SLOT="P1">
<ProcessorModel VALUE="Intel(R) Xeon(R) E5-2697 v2 2.70GHz"/>
</Processor>
</Processor_Info>
<ILOM_Info>
<ILOM>
<ILOMVersion VALUE="3.1.2.30.b r86439"/>
</ILOM>
</ILOM_Info>
<PhysicalMemory_Info>
<PhysicalMemoryGB VALUE="256"/>
</PhysicalMemory_Info>
<DiskController_Info>
<DiskController>
<DiskControllerModel VALUE="LSI MegaRAID SAS 9261-8i"/>
<DiskControllerPCIeSlotWidth VALUE="x8"/>
<DiskControllerSilicon VALUE="B4"/>
<DiskControllerBatteryBackupUnitVersion VALUE="iBBU08"/>
<DiskControllerFirmwareRevision VALUE="12.12.0-0243"/>
<DiskControllerPCIeSlotNumber VALUE="4"/>
</DiskController>
</DiskController_Info>
<InfinibandHCA_Info>
<InfinibandHCA SIGNATURE="X4-2_X4-2L_X4-8_ORC1090120019" SLOT="3">
<InfinibandHCAImageType VALUE="ConnectX"/>
<InfinibandHCAPCIeSlotSpeed VALUE="8GTs"/>
<InfinibandHCAPCIeSlotWidth VALUE="x8"/>
<InfinibandHCAChipRevision VALUE="0x1"/>
<InfinibandHCADeviceID VALUE="4099"/>
<InfinibandHCAFirmwareVersion FIRMWARE_ID="3" VALUE="2.35.5532"/> <<<<!!!!  
<InfinibandHCAPCIeSlotNumber VALUE="3"/>
<InfinibandHCAHardwareRev VALUE="XX"/>
<InfinibandHCAPSID VALUE="ORC1090120019"/>
</InfinibandHCA>
</InfinibandHCA_Info>

This was checked against the command output from the surviving node.
The difference is as below


Affected Node:
<InfinibandHCAFirmwareVersion FIRMWARE_ID="3" VALUE="2.35.5532"/> 

Surviving Node:
<InfinibandHCAFirmwareVersion FIRMWARE_ID="3" VALUE="2.11.1280"/>

How come there is a mismatch? We rescued the affected node, right? Ideally all components should be as per the state before upgrade.

So, what led to this????

When we patch the OS system, all components get upgraded including the firmware version.
But in case of failed OS patching, when we rescue the node, the OS components are downgraded but not the firmware.

This was known issue and we able to reckoned using below Doc Id

Exadata 18.1.4.0.0 release and patch (27347059) (Doc ID 2348949.1)
Issue 2.10 – Infiniband card FW 2.35.5532 is incompatible on Exadata Storage Servers 11.2.3.3.1 and earlier releases

To avoid Infiniband card FW 2.35.5532 compatibility issue in Exadata Storage Servers 11.2.3.3.1 and earlier releases, system should rollback the Infiniband card firmware to 2.11.1280. Symptoms of this issue include validations hang, Infiniband network issues.

Solution Implemented

We have to manually downgraded Infiniband card FW on compute node.
This was done using as below command:

  
/opt/oracle.cellos/CheckHWnFWProfile -action updatefw -component InfinibandHCA -mode exact
Server_Model=SUN_SERVER_X4-2
Check for firmware to be updated...
Initialize firmware update, please wait...

[INFO] Extract firmware package
[INFO] Extract firmware package finished.
Check md5sum for firmware files....
md5sum check for InfinibandHCA, InfinibandHCAFirmwareVersion with signature=X4-2_X4-2L_X4-8_ORC1090120019, succeeded.
md5sum check for InfinibandHCA, InfinibandHCAFirmwareVersion with signature=X4-2_X4-2L_X4-8_MT_1090120019, succeeded.
All firmware md5sums match.
Now update:InfinibandHCA,InfinibandHCAFirmwareVersion,firmware_id=3
[INFO] Now updating the InfiniBand Host Control Adapter (HCA) firmware ...
[INFO] Now updating the InfiniBand Host Control Adapter (HCA) firmware ...
[INFO] This will shut down the InfiniBand interfaces.
[INFO] Executing: /usr/bin/mstflint -y -d /proc/bus/pci/40/00.0 -i /tmp/firmware/SUNInfinibandHCAFirmware_ORC1090120019 burn  

Current FW version on flash: 2.35.5532
New FW version: 2.11.1280

Note: The new FW version is not newer than the current FW version on flash.

Do you want to continue ? (y/n) [n] : y

Burning FW image without signatures - OK
Restoring signature - OK
All firmware update suceeded
[INFO] Post firmware update process
[INFO] Rebooting in 5 minutes for firmware updates to take effect ...

Cache Flush is successfully done on adapter 0.

Exit Code: 0x00
Starting ipmi drivers: [ OK ]
Starting ipmi_watchdog driver: [ OK ]
Starting ipmi_poweroff driver: [ OK ]

Cache Flush is successfully done on adapter 0.

Exit Code: 0x00
[INFO] /usr/bin/ipmitool power cycle
Chassis Power Control: Cycle

Post this ASM instance started and we were able to bring up corresponding database instances.
Wow, this was out of nowhere. But the good thing is, we recovered from a precarious situation.

But!!!! What is the RCA for Compute Node Patching Failure

Even though we started database instance successfully, compute nodes were still at older version and patching must be done for it.
Now came the time where we had to arrive at the RCA and continue with patching.

With such issues, Customer was not in mood to go ahead with Compute Node patching without any concrete RCA. We proceed with dbnodeupdate logs verification

Log Verification

The dbnodeupdate.log didn’t seem to have the complete information.
The log concludes after the filesystem backup.

  
[1544291046][2018-12-08 12:47:24 -0500][INFO][./dbnodeupdate.sh][PrintMsg][] (ACTION:) Performing filesystem backup to /dev/mapper/VGExaDb-LVDbSys2. Avg. 30 minutes (maximum 120) depends per environment
[1544291046][2018-12-08 12:47:24 -0500][INFO][./dbnodeupdate.sh][ExecUpgrade][] Performing filesystem backup to /dev/mapper/VGExaDb-LVDbSys2. Avg. 30 minutes (maximum 120) depends per environment
[1544291046][2018-12-08 12:47:24 -0500][INFO][./dbnodeupdate.sh][DiaryEntry][] Entering ExecDbServerBackupScript /dev/mapper/VGExaDb-LVDbSys2 No
[1544291046][2018-12-08 12:47:24 -0500][INFO][./dbnodeupdate.sh][ExecDbServerBackupScript][] /opt/oracle.SupportTools/dbnodeupdate_helpers/11.2.3.3.1/dbserver_backup.sh
[1544291046][2018-12-08 12:47:24 -0500][INFO][./dbnodeupdate.sh][DiaryEntry][] Entering SearchAndWaitForString No other partitions were backed up. You may manually prepare back up for other partitions. /var/log/exadatatmp/081218124406/backup.out.081218124406 7200 already exists on the spare root partition Backup failed Size of spare root partition less then size  

The reason for the patch got failed could be due to a corrupted GRUB.
This was accentuated by the fact that the ILOM showed the compute node stuck on GRUB prompt.

At this point we have taken help from Oracle Global Support team to figure out the exact reason.

Messages from ILOM host console logs:

  
Active dbupdate is set to yes
[INFO ] An OS upgrade post steps are indicated by: ol5-ol6
[INFO ] Redirecting call to the host-based [/opt/oracle.cellos/image_functions] function [image_functions_dbnodeupdate_post] 
[INFO ] File [/opt/oracle.cellos/image_functions] does not have function [image_functions_dbnodeupdate_post]; will proceed with the code here.
2018-12-08 13:09:16 -0500 [INFO] set_kernel: Set default 4.1.12-94.8.4.el6uek
2018-12-08 13:09:22 -0500 [INFO] Set "root=LABEL=DBSYS bootarea=dbsys bootfrom=BOOT ro loglevel=7 panic=60 debug pci=noaer log_buf_len=1m nmi_watchdog=0 transparent_hugepage=never rd_NO_PLYMOUTH audit=1 conso 
le=tty1 console=ttyS0,115200n8 crashkernel=448M@128M numa=off" in the kernel line with 4.1.12-94.8.4.el6uek.x86_64 in /boot/grub/grub.conf
cat: /boot/grub/grub.stage.version: No such file or directory
[INFO ] Grub version in /boot/grub/grub.stage.version: undefined
[INFO ] Grub rpm version: 0.97-94.0.3.el6_7.1
[INFO ] Copying /usr/share/grub/x86_64-redhat/* to /boot/grub
[INFO ] Force setup grub

Further analysis found that

  • Message “cat: /boot/grub/grub.stage.version: No such file or directory” was reported during the “Post upgrade configuration steps” when running dbnodeupgrade.sh , and then it got hung:
  
2967 ^[[40;36m[INFO ]^[[0m %%% [Database Node OS Upgrade: Phase Four] Post upgrade configuration steps, cleanup %%%  
2968 ^[[40;36m[INFO ]^[[0m [Phase 4] Post package upgrade configurations.
2969 ^[[40;36m[INFO ]^[[0m +-------------------------------------------------+
2970 ^[[40;36m[INFO ]^[[0m |CONFIGURATION STEPS WILL NOW BE RUN.
2971 ^[[40;36m[INFO ]^[[0m |THESE CAN TAKE 10 MINUTES. WAIT FOR FINAL REBOOT.
2972 ^[[40;36m[INFO ]^[[0m +-------------------------------------------------+
.
.
3001 cat: /boot/grub/grub.stage.version: No such file or directory
3002 ^[[40;36m[INFO ]^[[0m Grub version in /boot/grub/grub.stage.version: undefined
3003 ^[[40;36m[INFO ]^[[0m Grub rpm version: 0.97-94.0.3.el6_7.1
3004 ^[[40;36m[INFO ]^[[0m Copying /usr/share/grub/x86_64-redhat/* to /boot/grub
3005 ^[[40;36m[INFO ]^[[0m Force setup grub

  • From OS level, /boot/grub/grub.stage.version file is not a native file in standard Linux OL6, it should be specific for Exadata engineering system. The file /boot/grub/grub.stage.version appears to be missing, may cause the reboot to hang.

Besides dbnodeupdate.sh.trc shows following data:

  
[1544292352][2018-12-08 13:09:22 -0500][TRACE][/opt/oracle.SupportTools/dbupgrade/image_functions - 7156][calculate_sysdisk_device][7156] Arguments: 0 lsi COMPUTE 0 
[1544292352][2018-12-08 13:09:22 -0500][WARNING][/opt/oracle.SupportTools/dbupgrade/image_functions - 7285][calculate_sysdisk_device][7285] [CMD: -get vd_by_name DBSYS] 
[CMD_STATUS: 127] 
----- START STDERR ----- 
/opt/oracle.SupportTools/dbupgrade/imageLogger: line 637: -get: command not found

It was Exadata Model X4-2 upgrade from 11.2.3.3.1.140529.1 to 12.2.1.1.8.180818 using dbnodeupdate.sh.
The dbupgrade.sh failed due to unable to retrieve Target Id, because LSI_INTERFACE was not defined in /opt/oracle.cellos/exadata.img.env

This is due to: BUG 28542279 – DBNODEUPDATE.SH ERROR MATCH_RE: COMMAND NOT FOUND

Solution

Before Upgrade Copy exadata.img.env file from dbserver.patch.zip to /opt/oracle.cellos directory in each compute node, or use the latest dbserver.patch.zip

Activity Outcome

We have downloaded latest dbserver.patch.zip
Patch 21634633: DBSERVER.PATCH.ZIP ORCHESTRATOR PLUS DBNU – ARU PLACEHOLDER

Copied exadata.img.env file to all compute node /opt/oracle.cellos directory.

Herewith all Compute Nodes has been patched successfully along with IB switch upgrade.
This concluded the Exadata Patching Activity.

2 thoughts on ““Exadata Patching Failed” : ASM instance not starting after rescue”

Leave a Reply

Your email address will not be published. Required fields are marked *