Table of Contents
Problem Statement
The database upgraded to Oracle version 19c. However, on a regular working day evening, the application team noticed a complete transaction timeout lasting for a couple of hours. Issue stands resolved post EOD completion. The application team confirmed that these were routine jobs, and the problem only began occurring after the 19c upgrade.
Remarkably, a similar issue recurred on the following evening. As a result, we initiated a comprehensive RCA to investigate this problem historically.
Diagnosis
As confirmed by the application team, it was a complete HANG during the timeout period. To identify the underlying cause of this HANG, we examined the trace file generated during the problematic period.
We have retrieved the following automatic diagnostic trace file:
File Name: Jul 19 20:55 DBNAME_dia0_21759812_base_5.trc
In our analysis of this file, we scrutinised the following data.
*** 2022-07-19T20:40:11.640559+05:30 HM: Early Warning - Session ID 876 serial# 58845 OS PID 7210122 (FG) is waiting on 'library cache load lock' for 33 seconds, wait id 12806 p1: 'object address'=0x7000101499aa4a8, p2: 'lock address'=0x7000101cb8659f0, p3: '100*mask+namespace'=0x3262f00010003 Final Blocker is Session ID 3558 serial# 10781 on instance 1 which is waiting on 'latch: shared pool' for 2 seconds, wait id 18217 p1: 'address'=0x7000000007bae10, p2: 'number'=0x26b, p3: 'why'=0x0 IO Total Self- Total Total Outlr Outlr Outlr Hung Rslvd Rslvd Wait WaitTm Wait WaitTm Wait Sess Hangs Hangs Count Secs Count Secs Count Wait Event ------ ------ ------ ------ ------ ------ ------ ------ ----------- 172 0 0 1771820 96279 209 51264 0 library cache load lock HM: Current SQL: update TABLE_NAME set COLUMN1=28373 where COLUMN2='XXXXXX' and SDATE='19-Jul-2022' *** 2022-07-19T20:40:11.642524+05:30 HM: Early Warning - Session ID 1448 serial# 22208 OS PID 22414950 (FG) is waiting on 'library cache load lock' for 33 seconds, wait id 16846 p1: 'object address'=0x7000101499aa4a8, p2: 'lock address'=0x700010227568800, p3: '100*mask+namespace'=0x3262f00010003 Final Blocker is Session ID 3558 serial# 10781 on instance 1 which is waiting on 'latch: shared pool' for 2 seconds, wait id 18217 p1: 'address'=0x7000000007bae10, p2: 'number'=0x26b, p3: 'why'=0x0 IO Total Self- Total Total Outlr Outlr Outlr Hung Rslvd Rslvd Wait WaitTm Wait WaitTm Wait Sess Hangs Hangs Count Secs Count Secs Count Wait Event ------ ------ ------ ------ ------ ------ ------ ------ ----------- 172 0 0 1771820 96279 209 51264 0 library cache load lock HM: Current SQL: update TABLE_NAME set COLUMN1=49823 where COLUMN2='XXXXXX' and SDATE='19-Jul-2022'
As previously noted, the waiting session became stalled due to a “library cache load lock” while the final blocker was waiting on a “latch: shared pool”.
Furthermore, the current SQL query exhibited a nearly identical structure, with only a minor alteration in the literal value of COLUMN1.
We have decided to generate an AWR report to investigate the performance during the time frame from July 19, 2022, 20:30:00 to July 19, 2022, 21:00:32.
The ADDM statistics yielded straightforward findings, highlighting issues with “Shared Pool Latches” & “Session Connect and Disconnect”. Additionally, there was a significant concern with “Hard Parse Due to Literal Usage”, indicating a high occurrence.
As visible in the Load Profile, the Hard parses per second registered at 21.8, while Logons per second reached 36.3
Even though these were small numbers, they can certainly have a noticeable impact on concurrent OLTP systems.
The Top Foreground Events reveal significant wait times associated with “latch: shared pool”, “library cache: mutex X” and “library cache load loc” among others.
Time model statistics shows a notable consumption of “connection management call elapsed time” and “parse time elapsed”.
Both of these components utilised memory from the ‘Shared Pool’. Our approach to understand the ‘latch: shared pool’ wait time divided into three distinct phases.
- latch: shared pool
- Parse time elapsed
- connection management call elapsed time
latch: shared pool
During our investigation into the root cause of the elevated shared pool latches, we uncovered interesting insights within the “Wait Event Histogram”.
We observed a few instances of the “SGA: allocation forcing component growth” operation, each taking between 8 to 32 seconds to complete.
This event signals that the database is actively expanding the SGA components to meet the demands of concurrent application requests.
Our attention swiftly shifted to the sections titled “Memory Resize Operations Summary” & “Memory Resize Ops”, where we noted dynamic growth in the “shared pool”.
Why is the shared pool dynamically expanding, and could this growth be related to the 19c upgrade?
Parse time elapsed
Time Model Statistics highlighted 50% of “parse time elapsed” consumed by “hard parse elapsed time”.
As observed in the AWR load profile, the rate of hard parses per second fell within the 20-30 range. This frequent hard parsing activity led to the dynamic expansion of the “shared pool” in the database.
To identify the culprits behind this issue, we extracted a list of top queries that were using literals using following SQL.
col "unshared count" for a100 WITH c AS (SELECT force_matching_signature, Count(*) cnt FROM gv$sqlarea WHERE force_matching_signature != 0 GROUP BY force_matching_signature HAVING Count(*) > 20), sq AS (SELECT sql_id, substr(sql_text,1,40) sql_text, force_matching_signature, Row_number() over ( PARTITION BY force_matching_signature ORDER BY sql_id DESC) p FROM gv$sqlarea s WHERE force_matching_signature IN (SELECT force_matching_signature FROM c) ) SELECT sq.sql_id, sq.force_matching_signature, sq.sql_text, c.cnt "unshared count" FROM c, sq WHERE sq.force_matching_signature = c.force_matching_signature AND sq.p = 1 AND c.cnt > 500 ORDER BY c.cnt DESC;
SQL_ID FORCE_MATCHING_SIGNATURE SQL_TEXT unshared count ------------- --------------------------------- ---------------------------------------- -------------- gzz41qhxxxxxx 16950911907815000000 update TABLE_NAME set COL1=104094 where 13053 SDATE='21-Jul-2022' and COL2='XXX' gyp69g4xxxxxx 3416587319733000000 update TABLE_NAME set COLUMN1=28373 12627 0026 where COLUMN2='XXXXXX' and SDATE=' 19-Jul-2022'
We noticed similar queries in the automatic diagnostic trace file (DBNAME_dia0_21759812_base_5.trc).
This discovery provides valuable insights into why the “shared pool” was exhibiting dynamic growth.
However, the application team has pointed out that this intensive parsing behaviour was a consistent aspect of the application even in the 12c version.
The question still lingers: Is there any connection to the 19c upgrade?
connection management call elapsed time
This marked the highest consumer within the realm of “Time Model Statistics”. Although the rate of logons per second remained relatively consistent, the wait time in the 12c version was significantly less.
Drawing from our prior encounter during the 19c upgrade, we observed a comparable scenario characterised by a substantial DML to the ‘USER$’ table. We undertake to establish a connection with the current database in light of this observation.
In the section titled “SQL ordered by Elapsed Time”, a notable surge in execution was apparent for SQL-ID 9zg9qd9bm4spu.
update user$ set spare6=DECODE(to_char(:2, 'YYYY-MM-DD'), '0000-00-00', to_date(NULL), :2) where user#=:1
This query has not been seen in the 12c database.
The “Segments by DB Blocks Changes” from AWR report indicates the highest DB Block Changes associated with the USER$ table.
So, what led to this????
The statistics presented above led us to identify the following Oracle Bug.
Bug 33121934 – Library cache lock / load lock / mutex x during connection storm due to update user$ (Doc ID 33121934.8)
User concurrent logins could experience disruptions as a result of the Last Successful Logon Time (LSLT) update:
If user concurrent logons experience delays during periods of high system load or in the midst of a logon storm while awaiting the Last Successful Logon Time (LSLT) update, it is possible that this issue has resurfaced.
During a connection storm you may see wait events like:
- library cache: mutex X
- library cache load lock
- library cache lock
- gc buffer busy acquire
All three pointers in the diagnostic section indicated that memory retrieval exclusively relied on the shared pool.
This had a substantial impact due to the high volume of logon activities and extensive hard parsing.
As a consequence, it led to the emergence of latch contention in the “shared pool”, “library cache: mutex X”eib and “library cache load lock”.
Solution Implemented
Our database version was at PSU 19.15. We applied the following patch to address the issue in the 19c release.
Bug 33121934 Library cache lock / load lock / mutex x during connection storm due to update user$
In order to minimize the frequency of update query executions, the following parameter has been configured to execute this update statement every 30 seconds: _granularity_last_successful_login_time=30.
This controls the granularity of LSLT update for _ALL_ users. Say, it is set to 30 seconds and then all users logons within that period of 30 seconds will NOT update the LSLT for that user.
We have made a decision to free up any remaining space for shared pool latches.
Requested application team to transform these HARD PARSES SQL’s to utilize bind variables.
Recognising the challenge of modifying every query that relies on literals, we’ve opted to take a calculated risk by configuring cursor_sharing=FORCE.
Here are the results following all the modifications:
The number of Hard Parses (SQL) per second has been nearly eliminated.
Significant decrease in the elapsed time for “connection management calls” and “parsing time elapsed”.
Previously, the figures were 142,363 and 31,244, respectively.
The available free memory in the shared pool approached nearly 10GB, with no increase due to dynamic allocations.
All issues have been successfully resolved, and the application is currently running flawlessly!!!!
these kind of writeups are very hard to get. thank you
Great info and great way of troubleshooting such complex issues with vital evidences. Thank you for sharing.
Nice information. Thanks for sharing the knowledge and experience.