Table of Contents
Problem Statement
On EXADATA machine, after initiating archivelogs backup in the database, application started facing connectivity issues. Even from local server connection was very slow. It was HANG like situation causing entire machine slowdown.
Diagnosis
During archivelogs backup, associated sessions of all RMAN channels were waiting on “ASM file metadata operation” events. With these events backup time had increased up to 4-5 hours from existing 15-20 minutes.
At database layer we observed “ASM file metadata operation” events but without any blocking session. Finally issue was identified at ASM instance layer causing all databases hosted on EXADATA machine to be impacted.
Wait events in ASM instance
INST_ID SID EVENT P1 P2 SQL_ID STATE FBS PROGRAM
-------- ------ -------------------- -- --- ------------- --------- ------ ---------------------------------------
4 2330 GCS lock cvt S 5 8 amyka66uqxwqx WAITING [email protected] (TNS V1-V3)
--------------------
1
3 2913 GCS lock esc 5 8 WAITING [email protected] (TNS V1-V3)
--------------------
1
1 1546 GCS lock open S 5 8 amyka66uqxwqx WAITING [email protected] (TNS V1-V3)
--------------------
1
3 876 buffer busy 5 8 WAITING 390 [email protected] (TNS V1-V3)
3 1166 5 8 WAITING 390 [email protected] (TNS V1-V3)
3 1651 5 8 WAITING 390 [email protected] (TNS V1-V3)
3 1748 5 8 WAITING 390 [email protected] (TNS V1-V3)
3 4075 5 8 WAITING 390 [email protected] (TNS V1-V3)
3 4269 5 8 amyka66uqxwqx WAITING 390 [email protected] (TNS V1-V3)
3 683 5 8 WAITING 390 [email protected] (TNS V1-V3)
3 390 5 8 WAITING [email protected] (TNS V1-V3)
1 3862 5 8 amyka66uqxwqx WAITING [email protected] (TNS V1-V3)
3 1263 5 8 WAITING 390 [email protected] (TNS V1-V3)
--------------------
10
In ASM instance, we could see blocking events like “GCS lock cvt S/GCS lock esc/GCS lock open S/buffer busy“. This indicates locking at GCS layer causing slowdown of entire ASM operations.
There were too many disk group stat queries (amyka66uqxwqx) running on the ASM instances which were attempting to fetch the USD block as part of the information gathering.
Contention was observed on Diskgroup #5 (p1) and USD(ASM Used Space) block 8(p2)
SQL Details
SQL_ID EXECUTIONS SQL_FULLTEXT ------------- ---------- --------------------------------------------------------------------------------- amyka66uqxwqx 129661 select name_kfgrp, number_kfgrp, incarn_kfgrp, compat_kfgrp, dbcompat_kfgrp, sta te_kfgrp, flags32_kfgrp, type_kfgrp, refcnt_kfgrp, sector_kfgrp, blksize_kfgrp, ausize_kfgrp , totmb_kfgrp, freemb_kfgrp, coldmb_kfgrp, hotmb_kfgrp, minspc_kfgr p, usable_kfgrp, offline_kfgrp, lflags_kfgrp from x$kfgrp_stat
Above SQL indicates increase in no. of executions while fetching diskgroup used space details.
Table x$kfgrp_stat referred in this statement is a part of dictionary view GV$ASM_DISKGROUP_STAT.
Reason behind high executions was the presence of a DBFS file system . Here, DBFS file system holds Golden Gate Trail files , thereby leading to frequent write operations on it. To continuously validate free space status, query executions has been increased.
To further diagnose the hang, global ASM HANGANALYZE was collected.
Global ASM HANGANALYZE
connect / as sysdba oradebug setmypid oradebug unlimit oradebug -g all hanganalyze 4 --Wait 1 min to give time to identify process state changes. oradebug -g all hanganalyze 4 oradebug tracefile_name oradebug close_trace exit
Following blockchain could be seen in HANGANALYZE report
Chains most likely to have caused the hang: [a] Chain 1 Signature: 'GCS lock esc'<='buffer busy' Chain 1 Signature Hash: 0x118af497 [b] Chain 2 Signature: 'GCS lock cvt S'<='buffer busy' Chain 2 Signature Hash: 0x33376d50 [c] Chain 3 Signature: 'GCS lock esc'<='buffer busy' Chain 3 Signature Hash: 0x118af497
To reduce impact of high executions of query(amyka66uqxwqx), following workaround was attempted
BEGIN FOR rws IN (SELECT store_name FROM TABLE(dbms_dbfs_sfs.listfilesystems)) LOOP dbms_dbfs_sfs.Addfsproperties(rws.store_name, Dbms_dbfs_content_properties_t(Dbms_dbfs_content_property_t( dbms_dbfs_sfs.sfs_props_df_cache, 900, dbms_types.typecode_number))); COMMIT; END LOOP; END; /
Note: This property can be set with PSU 11.2.0.4+BP 15 and later on.
With this change execution for this query has been reduced to 900 seconds however the anticipated improvement could still not be achieved in RMAN backup performance.
Since the issue was more or less connected to ASM , the next step was to review the best practices of ASM . This led us to checking the compatibility of ASM diskgroups .Below is a snippet of the diskgroup details .
Inst ID | ASM Compatibility | RDBMS Compatibility | Diskgroup |
---|---|---|---|
1 | 11.2.0.0.0 | 11.1.0.7.0 | DATADG1 |
1 | 11.2.0.3.0 | 11.2.0.3.0 | DATADG |
1 | 11.2.0.0.0 | 11.1.0.7.0 | DBFSDG1 |
1 | 11.2.0.2.0 | 11.2.0.2.0 | DBFSDG |
1 | 11.2.0.0.0 | 11.1.0.7.0 | RECODG1 |
1 | 11.2.0.3.0 | 11.2.0.3.0 | RECODG |
In an ideal scenario, ASM compatibility must be set to current GRID Infrastructure version and RDBMS compatibility can be set to a lower COMPATIBLE database.
As per the diskgroup details, few disk groups were configured with ASM compatibility 11.2.0.3.0 and others at 11.2.0.0.0.
Upon analyzing the above output , we could notice that the compatibility for the affected diskgroup #5 was configured to be lower than that of the GI version. The diskgroup #5 hosted the RMAN backupset alongside hosting the archivelogs as well .The compatibility mismatch led to the overall increase in the reads and writes on the diskgroup .
Solution Implemented
Disk group ASM compatibility was raised to 11.2.0.4 , the current GI version .
ALTER DISKGROUP DATADG1 SET ATTRIBUTE 'compatible.asm' = '11.2.0.4.0'; ALTER DISKGROUP DBFSDG1 SET ATTRIBUTE 'compatible.asm' = '11.2.0.4.0'; ALTER DISKGROUP RECODG1 SET ATTRIBUTE 'compatible.asm' = '11.2.0.4.0';
Post this change all “ASM file metadata operation” events disappeared.
Hang issue has been resolved.
RMAN backup has completed within 15 minutes.
Reference Bug
Bug 21163069: EXADATA: Concurrent ‘CREATE TABLESPACE’ in ASM very slow
Hi Dinesh
Can you please share the script having output mentioned in “Wait events in ASM instance” section
Thanks. Very Helpful details.
nice article…..