RMAN Backup waiting on “ASM file metadata operation”

Problem Statement

On EXADATA machine, after initiating archivelogs backup in the database, application started facing connectivity issues. Even from local server connection was very slow. It was HANG like situation causing entire machine slowdown.

Diagnosis

During archivelogs backup, associated sessions of all RMAN channels were waiting on “ASM file metadata operation” events. With these events backup time had increased up to 4-5 hours from existing 15-20 minutes.

At database layer we observed “ASM file metadata operation” events but without any blocking session. Finally issue was identified at ASM instance layer causing all databases hosted on EXADATA machine to be impacted.

Wait events in ASM instance
   
 INST_ID    SID EVENT                P1  P2 SQL_ID        STATE        FBS PROGRAM
-------- ------ -------------------- -- --- ------------- --------- ------ ---------------------------------------
       4   2330 GCS lock cvt S        5   8 amyka66uqxwqx WAITING          [email protected] (TNS V1-V3)
                --------------------
                                   1
       3   2913 GCS lock esc          5   8               WAITING          [email protected] (TNS V1-V3)
                --------------------
                                   1
       1   1546 GCS lock open S       5   8 amyka66uqxwqx WAITING          [email protected] (TNS V1-V3)
                --------------------
                                   1
       3    876 buffer busy           5   8               WAITING      390 [email protected] (TNS V1-V3)
       3   1166                       5   8               WAITING      390 [email protected] (TNS V1-V3)
       3   1651                       5   8               WAITING      390 [email protected] (TNS V1-V3)
       3   1748                       5   8               WAITING      390 [email protected] (TNS V1-V3)
       3   4075                       5   8               WAITING      390 [email protected] (TNS V1-V3)
       3   4269                       5   8 amyka66uqxwqx WAITING      390 [email protected] (TNS V1-V3)
       3    683                       5   8               WAITING      390 [email protected] (TNS V1-V3)
       3    390                       5   8               WAITING          [email protected] (TNS V1-V3)
       1   3862                       5   8 amyka66uqxwqx WAITING          [email protected] (TNS V1-V3)
       3   1263                       5   8               WAITING      390 [email protected] (TNS V1-V3)
                --------------------
                                  10

In ASM instance, we could see blocking events like “GCS lock cvt S/GCS lock esc/GCS lock open S/buffer busy“. This indicates locking at GCS layer causing slowdown of entire ASM operations.

There were too many disk group stat queries (amyka66uqxwqx) running on the ASM instances which were attempting to fetch the USD block as part of the information gathering.

Contention was observed on Diskgroup #5 (p1)   and USD(ASM Used Space) block 8(p2)

SQL Details
 
SQL_ID        EXECUTIONS SQL_FULLTEXT
------------- ---------- ---------------------------------------------------------------------------------
amyka66uqxwqx     129661 select name_kfgrp, number_kfgrp, incarn_kfgrp, compat_kfgrp, dbcompat_kfgrp, sta
                         te_kfgrp, flags32_kfgrp, type_kfgrp, refcnt_kfgrp, sector_kfgrp, blksize_kfgrp,
                         ausize_kfgrp , totmb_kfgrp, freemb_kfgrp, coldmb_kfgrp, hotmb_kfgrp, minspc_kfgr
                         p, usable_kfgrp, offline_kfgrp, lflags_kfgrp  from x$kfgrp_stat

Above SQL indicates increase in no. of executions while fetching diskgroup used space details.
Table x$kfgrp_stat referred in this statement is a part of dictionary view GV$ASM_DISKGROUP_STAT. 

Reason behind high executions was the presence of a DBFS file system . Here, DBFS file system holds Golden Gate Trail files , thereby leading to frequent write operations on it. To continuously validate free space status, query executions has been increased.

To further diagnose the hang, global ASM HANGANALYZE was collected.

Global ASM HANGANALYZE
 
connect / as sysdba
oradebug setmypid
oradebug unlimit
oradebug -g all hanganalyze 4
--Wait 1 min to give time to identify process state changes.
oradebug -g all hanganalyze 4
oradebug tracefile_name
oradebug close_trace
exit

Following blockchain could be seen in HANGANALYZE report

  
 Chains most likely to have caused the hang:
 [a] Chain 1 Signature: 'GCS lock esc'<='buffer busy'
 Chain 1 Signature Hash: 0x118af497
 [b] Chain 2 Signature: 'GCS lock cvt S'<='buffer busy'
 Chain 2 Signature Hash: 0x33376d50
 [c] Chain 3 Signature: 'GCS lock esc'<='buffer busy'
 Chain 3 Signature Hash: 0x118af497

To reduce impact of high executions of query(amyka66uqxwqx), following workaround was attempted

  
 BEGIN 
 FOR rws IN (SELECT store_name 
 FROM TABLE(dbms_dbfs_sfs.listfilesystems)) LOOP 
 dbms_dbfs_sfs.Addfsproperties(rws.store_name, 
 Dbms_dbfs_content_properties_t(Dbms_dbfs_content_property_t( 
 dbms_dbfs_sfs.sfs_props_df_cache, 900, 
 dbms_types.typecode_number))); 
 COMMIT; 
 END LOOP; 
 END; 
 /

Note: This property can be set with PSU 11.2.0.4+BP 15 and later on.

With this change execution for this query has been reduced to 900 seconds however the anticipated improvement could still not be achieved in RMAN backup performance.

Since the issue was more or less connected to ASM , the next step was to review the best practices of ASM . This led us to checking the compatibility of ASM diskgroups .Below is a snippet of the diskgroup details .  

Inst ID ASM Compatibility RDBMS Compatibility Diskgroup
1 11.2.0.0.0 11.1.0.7.0 DATADG1
1 11.2.0.3.0 11.2.0.3.0 DATADG
1 11.2.0.0.0 11.1.0.7.0 DBFSDG1
1 11.2.0.2.0 11.2.0.2.0 DBFSDG
1 11.2.0.0.0 11.1.0.7.0 RECODG1
1 11.2.0.3.0 11.2.0.3.0 RECODG

In an ideal scenario, ASM compatibility must be set to current GRID Infrastructure version and RDBMS compatibility can be set to a lower COMPATIBLE database.

As per the diskgroup details, few disk groups were configured with ASM compatibility 11.2.0.3.0 and others at 11.2.0.0.0.

Upon analyzing the above output , we could notice that the compatibility for the affected diskgroup #5 was configured to be lower than that of the GI version. The diskgroup #5 hosted the RMAN backupset alongside hosting the archivelogs as well .The compatibility mismatch led to the overall increase in the reads and writes on the diskgroup .

Solution Implemented

Disk group ASM compatibility was raised to 11.2.0.4 , the current GI version .

 
ALTER DISKGROUP DATADG1 SET ATTRIBUTE 'compatible.asm' = '11.2.0.4.0';
ALTER DISKGROUP DBFSDG1 SET ATTRIBUTE 'compatible.asm' = '11.2.0.4.0';
ALTER DISKGROUP RECODG1 SET ATTRIBUTE 'compatible.asm' = '11.2.0.4.0';

Post this change all “ASM file metadata operation” events disappeared.
Hang issue has been resolved.
RMAN backup has completed within 15 minutes.

Reference Bug
Bug 21163069: EXADATA: Concurrent ‘CREATE TABLESPACE’ in ASM very slow

3 thoughts on “RMAN Backup waiting on “ASM file metadata operation””

Leave a Reply

Your email address will not be published. Required fields are marked *