“latch: shared pool”, “library cache: mutex X” & “library cache load lock”

Problem Statement

The database upgraded to Oracle version 19c. However, on a regular working day evening, the application team noticed a complete transaction timeout lasting for a couple of hours. Issue stands resolved post EOD completion. The application team confirmed that these were routine jobs, and the problem only began occurring after the 19c upgrade.

Remarkably, a similar issue recurred on the following evening. As a result, we initiated a comprehensive RCA to investigate this problem historically.

Diagnosis

As confirmed by the application team, it was a complete HANG during the timeout period. To identify the underlying cause of this HANG, we examined the trace file generated during the problematic period.

We have retrieved the following automatic diagnostic trace file:
File Name: Jul 19 20:55 DBNAME_dia0_21759812_base_5.trc

In our analysis of this file, we scrutinised the following data.

 
 *** 2022-07-19T20:40:11.640559+05:30
 HM: Early Warning - Session ID 876 serial# 58845 OS PID 7210122 (FG)
     is waiting on 'library cache load lock' for 33 seconds, wait id 12806
     p1: 'object address'=0x7000101499aa4a8, p2: 'lock address'=0x7000101cb8659f0, p3: '100*mask+namespace'=0x3262f00010003 
    Final Blocker is Session ID 3558 serial# 10781 on instance 1
     which is waiting on 'latch: shared pool' for 2 seconds, wait id 18217
     p1: 'address'=0x7000000007bae10, p2: 'number'=0x26b, p3: 'why'=0x0

                                                     IO
  Total  Self-         Total  Total  Outlr  Outlr  Outlr
   Hung  Rslvd  Rslvd   Wait WaitTm   Wait WaitTm   Wait
   Sess  Hangs  Hangs  Count   Secs  Count   Secs  Count Wait Event
 ------ ------ ------ ------ ------ ------ ------ ------ -----------
    172      0      0 1771820 96279    209  51264      0 library cache load lock

 HM: Current SQL: update TABLE_NAME set COLUMN1=28373 where COLUMN2='XXXXXX' and SDATE='19-Jul-2022'


 *** 2022-07-19T20:40:11.642524+05:30
 HM: Early Warning - Session ID 1448 serial# 22208 OS PID 22414950 (FG)
     is waiting on 'library cache load lock' for 33 seconds, wait id 16846
     p1: 'object address'=0x7000101499aa4a8, p2: 'lock address'=0x700010227568800, p3: '100*mask+namespace'=0x3262f00010003
    Final Blocker is Session ID 3558 serial# 10781 on instance 1
     which is waiting on 'latch: shared pool' for 2 seconds, wait id 18217
     p1: 'address'=0x7000000007bae10, p2: 'number'=0x26b, p3: 'why'=0x0

                                                     IO
  Total  Self-         Total  Total  Outlr  Outlr  Outlr
   Hung  Rslvd  Rslvd   Wait WaitTm   Wait WaitTm   Wait
   Sess  Hangs  Hangs  Count   Secs  Count   Secs  Count Wait Event
 ------ ------ ------ ------ ------ ------ ------ ------ -----------
    172      0      0 1771820 96279    209  51264      0 library cache load lock

 HM: Current SQL: update TABLE_NAME set COLUMN1=49823 where COLUMN2='XXXXXX' and SDATE='19-Jul-2022'

As previously noted, the waiting session became stalled due to a “library cache load lock” while the final blocker was waiting on a “latch: shared pool”.
Furthermore, the current SQL query exhibited a nearly identical structure, with only a minor alteration in the literal value of COLUMN1.

We have decided to generate an AWR report to investigate the performance during the time frame from July 19, 2022, 20:30:00 to July 19, 2022, 21:00:32.

The ADDM statistics yielded straightforward findings, highlighting issues with “Shared Pool Latches” & “Session Connect and Disconnect”. Additionally, there was a significant concern with “Hard Parse Due to Literal Usage”, indicating a high occurrence.

As visible in the Load Profile, the Hard parses per second registered at 21.8, while Logons per second reached 36.3

Even though these were small numbers, they can certainly have a noticeable impact on concurrent OLTP systems.
The Top Foreground Events reveal significant wait times associated with “latch: shared pool”, “library cache: mutex X” and “library cache load loc” among others.

Time model statistics shows a notable consumption of “connection management call elapsed time” and “parse time elapsed”.

Both of these components utilised memory from the ‘Shared Pool’. Our approach to understand the ‘latch: shared pool’ wait time divided into three distinct phases.

  • latch: shared pool
  • Parse time elapsed
  • connection management call elapsed time

latch: shared pool

During our investigation into the root cause of the elevated shared pool latches, we uncovered interesting insights within the “Wait Event Histogram”.

We observed a few instances of the “SGA: allocation forcing component growth” operation, each taking between 8 to 32 seconds to complete.
This event signals that the database is actively expanding the SGA components to meet the demands of concurrent application requests.

Our attention swiftly shifted to the sections titled “Memory Resize Operations Summary” & “Memory Resize Ops”, where we noted dynamic growth in the “shared pool”.

Why is the shared pool dynamically expanding, and could this growth be related to the 19c upgrade?

Parse time elapsed

Time Model Statistics highlighted 50% of “parse time elapsed” consumed by “hard parse elapsed time”.

As observed in the AWR load profile, the rate of hard parses per second fell within the 20-30 range. This frequent hard parsing activity led to the dynamic expansion of the “shared pool” in the database.

To identify the culprits behind this issue, we extracted a list of top queries that were using literals using following SQL.

 
 col "unshared count" for a100
 WITH c
       AS (SELECT force_matching_signature,
                     Count(*) cnt
            FROM   gv$sqlarea
            WHERE  force_matching_signature != 0
            GROUP  BY force_matching_signature
            HAVING Count(*) > 20),
       sq
       AS (SELECT sql_id,
                     substr(sql_text,1,40) sql_text,
                     force_matching_signature,
                     Row_number()
                       over (
                          PARTITION BY force_matching_signature
                          ORDER BY sql_id DESC) p
            FROM   gv$sqlarea s
            WHERE  force_matching_signature IN (SELECT force_matching_signature 
                                                FROM c)
          )
 SELECT sq.sql_id,
        sq.force_matching_signature,
        sq.sql_text,
        c.cnt "unshared count"
 FROM   c,
         sq
 WHERE  sq.force_matching_signature = c.force_matching_signature
         AND sq.p = 1
         AND c.cnt > 500
 ORDER  BY c.cnt DESC;

 
 SQL_ID                 FORCE_MATCHING_SIGNATURE SQL_TEXT                                 unshared count
 ------------- --------------------------------- ---------------------------------------- --------------
 gzz41qhxxxxxx              16950911907815000000 update TABLE_NAME set COL1=104094 where           13053 
                                                 SDATE='21-Jul-2022' and COL2='XXX'

 gyp69g4xxxxxx               3416587319733000000 update TABLE_NAME set COLUMN1=28373               12627
                                                 0026 where COLUMN2='XXXXXX' and SDATE='
                                                 19-Jul-2022'

We noticed similar queries in the automatic diagnostic trace file (DBNAME_dia0_21759812_base_5.trc).
This discovery provides valuable insights into why the “shared pool” was exhibiting dynamic growth.

However, the application team has pointed out that this intensive parsing behaviour was a consistent aspect of the application even in the 12c version.

The question still lingers: Is there any connection to the 19c upgrade?

connection management call elapsed time

This marked the highest consumer within the realm of “Time Model Statistics”. Although the rate of logons per second remained relatively consistent, the wait time in the 12c version was significantly less.

Drawing from our prior encounter during the 19c upgrade, we observed a comparable scenario characterised by a substantial DML to the ‘USER$’ table. We undertake to establish a connection with the current database in light of this observation.

In the section titled “SQL ordered by Elapsed Time”, a notable surge in execution was apparent for SQL-ID 9zg9qd9bm4spu.

 
 update user$ set spare6=DECODE(to_char(:2,  'YYYY-MM-DD'),  '0000-00-00',  to_date(NULL),  :2) 
 where user#=:1

This query has not been seen in the 12c database.
The “Segments by DB Blocks Changes” from AWR report indicates the highest DB Block Changes associated with the USER$ table.

So, what led to this????

The statistics presented above led us to identify the following Oracle Bug.
Bug 33121934 – Library cache lock / load lock / mutex x during connection storm due to update user$ (Doc ID 33121934.8)

User concurrent logins could experience disruptions as a result of the Last Successful Logon Time (LSLT) update:

If user concurrent logons experience delays during periods of high system load or in the midst of a logon storm while awaiting the Last Successful Logon Time (LSLT) update, it is possible that this issue has resurfaced.

During a connection storm you may see wait events like:

  • library cache: mutex X
  • library cache load lock
  • library cache lock
  • gc buffer busy acquire

All three pointers in the diagnostic section indicated that memory retrieval exclusively relied on the shared pool.
This had a substantial impact due to the high volume of logon activities and extensive hard parsing.
As a consequence, it led to the emergence of latch contention in the “shared pool”, “library cache: mutex X”eib and “library cache load lock”.

Solution Implemented

Our database version was at PSU 19.15. We applied the following patch to address the issue in the 19c release.
Bug 33121934 Library cache lock / load lock / mutex x during connection storm due to update user$

In order to minimize the frequency of update query executions, the following parameter has been configured to execute this update statement every 30 seconds: _granularity_last_successful_login_time=30.

This controls the granularity of LSLT update for _ALL_ users. Say, it is set to 30 seconds and then all users logons within that period of 30 seconds will NOT update the LSLT for that user.

We have made a decision to free up any remaining space for shared pool latches.

Requested application team to transform these HARD PARSES SQL’s to utilize bind variables.
Recognising the challenge of modifying every query that relies on literals, we’ve opted to take a calculated risk by configuring cursor_sharing=FORCE.

Here are the results following all the modifications:

The number of Hard Parses (SQL) per second has been nearly eliminated.

Significant decrease in the elapsed time for “connection management calls” and “parsing time elapsed”.
Previously, the figures were 142,363 and 31,244, respectively.

The available free memory in the shared pool approached nearly 10GB, with no increase due to dynamic allocations.

All issues have been successfully resolved, and the application is currently running flawlessly!!!!

2 thoughts on ““latch: shared pool”, “library cache: mutex X” & “library cache load lock””

  1. Great info and great way of troubleshooting such complex issues with vital evidences. Thank you for sharing.

Leave a Reply

Your email address will not be published. Required fields are marked *