db file sequential read

Problem Statement

Support team reported a performance issue affecting one of the critical application. According to their update, end users experienced slowness, which in turn disrupted related workflows.

Diagnosis

Our investigation began with an assessment of database performance. Upon reviewing the database holistically through the OEM monitoring screens, we noticed a significant increase in db file sequential read wait events affecting over 100+ active sessions. The graph below illustrates the breakdown of active sessions across multiple database events during the issue period.

To analyze further, we retrieved the AWR report for the impacted timeframe. The top 10 wait events observed during this period are detailed below.

Over 52% of the database time was attributed to the “db file sequential read” event. While the number of waits was substantial, the average wait time per request was relatively low at 472 microseconds. Delving deeper into the “Wait Classes by Total Wait Time” section, we observed that more than 80% of the total time was consumed by the “User I/O” wait class, as shown in the details below.

Examining the “Host CPU” section revealed elevated wait times for “%system” and “%WIO” metrics. Compared to “%User” utilization, “%System” and “%WIO” were disproportionately higher.

Based on our prior experience, we promptly engaged the system team to verify if any OS agents were contributing to the “%System” utilization. However, no such agents were running or consuming CPU resources.

We then focused on investigating the “%WIO” metric.
Out of the total 19TB of I/O, Direct Reads accounted for 11TB, which likely contributed to the increased physical reads and high “%WIO” utilization on the server. Despite this, the average wait time per read was 678 microseconds only.

To identify the root cause, we compared the current state with a previous good period when the application was performing well. The application team confirmed that performance was stable during the same time a day earlier. Consequently, we decided to generate and analyze an AWR comparison report for the good and bad periods.

Based on our earlier understanding of “Wait Classes,” we reviewed the relevant section from AWR compare report.

We observed a significant 400x increase in “Total Wait Time (sec)” for the “User I/O” wait class during the issue period compared to the good period.

Additionally, for all top I/O-related events, both “# Waits/sec (Elapsed Time)” and “Total Wait Time (sec)” were considerably higher during the bad period.

Typically, with such outcome, we generally examine SQL’s for plan changes which may lead to higher I/O. Accordingly, we reviewed the “Top SQL Comparison by Elapsed Time” section. For most of the queries, there were no plan changes detected. However, “Elapsed Time (ms) per Exec” was noticeably higher during the issue period.

In the “Top SQL Comparison by Physical Reads” section, we identified that most SQL queries experienced a substantial increase in “Physical Reads per Exec”, even though no SQL plan changes were observed.

Now…… What next????

Given the increased physical reads without SQL plan change, we revisited the AWR Compare report to assess database memory allocation. A deeper observation revealed fascinating insights that provided us additional support for our investigation.

In the “Cache Sizes” section, we observed a significant shift: a 12GB decrease in “Buffer Cache” and a corresponding 12GB increase in “Shared Pool”

This observation provided insights into the root cause of the high physical reads and increased server “%WIO”. The drastic reduction in the “Buffer Cache”, which was left with only 1GB of free space, caused most queries to rely on physical reads, leading to a spike in “%WIO” on the system.

The “Load Profile” section corroborated this finding, as metrics like “Physical read (blocks):”, “Read IO requests:”, and “Read IO (MB):” had increased by approximately 200 times.

This raised a critical question: why did the “Shared Pool” requested additional memory, leading to this performance degradation?

To answer this, we turned to the “SGA Breakdown Difference” section to determine which components of the “Shared Pool” were responsible for the increase.

The analysis revealed a substantial growth in the PX msg pool pool, which expanded from 800MB to 16GB, resulting in significant dynamic memory resizing. Parallel execution processes usually utilize this pool, and no SQL’s were executing in parallel during the reported issue period..

Next, we reviewed the “init.ora Parameters” section for any changes related to parallel execution.
However, we did not observed any change. Parameters like parallel_min_servers (192) and parallel_max_servers (960) remained unchanged during the issue period.

This left us puzzled: Why did the “PX msg pool” grow so dramatically in the absence of any parameter changes or SQL executions?
Alright then… What’s the next step?

We investigated the timing of the SGA resize operations using below query.

 
 select COMPONENT,OPER_TYPE,PARAMETER,INITIAL_SIZE/1024/1024 INITIAL_SIZE,
        TARGET_SIZE/1024/1024 TARGET_SIZE,FINAL_SIZE/1024/1024 FINAL_SIZE, 
        STATUS,START_TIME
 from v$sga_resize_ops
 order by 8;

 COMPONENT                      OPER_TYPE     PARAMETER                 INITIAL_SIZE TARGET_SIZE FINAL_SIZE STATUS    START_TIME
 ------------------------------ ------------- ------------------------- ------------ ----------- ---------- --------- --------------------
 DEFAULT buffer cache           SHRINK        db_cache_size                    14208       14144      14144 COMPLETE  21-nov-2024 21:27:31
 DEFAULT buffer cache           SHRINK        db_cache_size                    14144       14080      14080 COMPLETE  21-nov-2024 21:27:31
 .
 .
 shared pool                    GROW          shared_pool_size                 10240       10304      10304 COMPLETE  21-nov-2024 21:27:31
 shared pool                    GROW          shared_pool_size                 10304       10368      10368 COMPLETE  21-nov-2024 21:27:31
 .
 .
 DEFAULT buffer cache           SHRINK        db_cache_size                    13568       13504      13504 COMPLETE  21-nov-2024 21:27:32
 DEFAULT buffer cache           SHRINK        db_cache_size                    13568       13504      13504 COMPLETE  21-nov-2024 21:27:32
 .
 .
 shared pool                    GROW          shared_pool_size                 10880       10944      10944 COMPLETE  21-nov-2024 21:27:32
 shared pool                    GROW          shared_pool_size                 10880       10944      10944 COMPLETE  21-nov-2024 21:27:32
 .
 .
 DEFAULT buffer cache           SHRINK        db_cache_size                     2944        2880       2880 COMPLETE  21-nov-2024 21:27:52
 DEFAULT buffer cache           SHRINK        db_cache_size                     2880        2816       2816 COMPLETE  21-nov-2024 21:27:52
 .
 .
 shared pool                    GROW          shared_pool_size                 21440       21504      21504 COMPLETE  21-nov-2024 21:27:52
 shared pool                    GROW          shared_pool_size                 21504       21568      21568 COMPLETE  21-nov-2024 21:27:52
 .
 .
 DEFAULT buffer cache           SHRINK        db_cache_size                     1792        1728       1728 COMPLETE  21-nov-2024 21:27:55
 DEFAULT buffer cache           SHRINK        db_cache_size                     1728        1664       1728 COMPLETE  21-nov-2024 21:27:57
 shared pool                    GROW          shared_pool_size                 22720       22784      22720 COMPLETE  21-nov-2024 21:27:57
 shared pool                    GROW          shared_pool_size                 22720       22784      22720 COMPLETE  21-nov-2024 21:27:57
 DEFAULT buffer cache           SHRINK        db_cache_size                     1728        1664       1728 COMPLETE  21-nov-2024 21:27:57
 .
 .
 shared pool                    GROW          shared_pool_size                 22848       22912      22912 ERROR     21-nov-2024 21:28:02 
 shared pool                    GROW          shared_pool_size                 22848       22912      22912 ERROR     21-nov-2024 21:28:02
 shared pool                    GROW          shared_pool_size                 22848       22912      22912 ERROR     21-nov-2024 21:28:02
 shared pool                    GROW          shared_pool_size                 22848       22912      22912 ERROR     21-nov-2024 21:28:02
 DEFAULT buffer cache           SHRINK        db_cache_size                     1728        1664       1664 COMPLETE  21-nov-2024 21:28:02
 DEFAULT buffer cache           SHRINK        db_cache_size                     1664        1600       1600 COMPLETE  21-nov-2024 21:28:02
 DEFAULT buffer cache           SHRINK        db_cache_size                     1600        1536       1536 COMPLETE  21-nov-2024 21:28:02
 DEFAULT buffer cache           SHRINK        db_cache_size                     1600        1536       1536 ERROR     21-nov-2024 21:28:02
 DEFAULT buffer cache           SHRINK        db_cache_size                     1600        1536       1536 ERROR     21-nov-2024 21:28:02
 DEFAULT buffer cache           SHRINK        db_cache_size                     1600        1536       1536 ERROR     21-nov-2024 21:28:02

Based on the data, most of the resize operations occurred between 21-Nov-2024 21:27:45 and 21-Nov-2024 21:28:02. However, a few of the most recent resize operations ended in an ‘ERROR’ state.

To further analyze the issue, we reviewed the database alert log for the same timeframe.

 
 2024-11-21T21:16:01.056066+05:30
 ARC2 (PID:53150282): Archived Log entry 1509998 added for T-1.S-280141 ID 0x22aa6a99 LAD:1
 2024-11-21T21:28:02.165503+05:30
 Errors in file /ora19c/app/diag/rdbms/DEMO/DEMO1/trace/DEMO1_p00g_16712562.trc (incident=2201426):
 ORA-04031: unable to allocate 32792 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","PX msg pool")
 2024-11-21T21:28:02.167306+05:30
 Errors in file /ora19c/app/diag/rdbms/DEMO/DEMO1/trace/DEMO1_p00c_46400308.trc (incident=2201386):
 ORA-04031: unable to allocate 32792 bytes of shared memory ("shared pool","unknown object","sga heap(2,0)","PX msg pool") 
 Incident details in: /ora19c/app/diag/rdbms/DEMO/DEMO1/incident/incdir_2201426/DEMO1_p00g_16712562_i2201426.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Incident details in: /ora19c/app/diag/rdbms/DEMO/DEMO1/incident/incdir_2201386/DEMO1_p00c_46400308_i2201386.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 2024-11-21T21:28:03.054828+05:30
 Errors in file /ora19c/app/diag/rdbms/DEMO/DEMO1/trace/DEMO1_p08h_58655090.trc (incident=2210650):
 ORA-04031: unable to allocate 32792 bytes of shared memory ("shared pool","unknown object","sga heap(6,0)","PX msg pool")
 Incident details in: /ora19c/app/diag/rdbms/DEMO/DEMO1/incident/incdir_2210650/DEMO1_p08h_58655090_i2210650.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 2024-11-21T21:28:03.190483+05:30
 Errors in file /ora19c/app/diag/rdbms/DEMO/DEMO1/trace/DEMO1_p09j_24511200.trc (incident=2210954):

During the resize operations, we identified ORA-04031 errors. The errors indicate a lack of available space in the shared pool, and resize operations were unsuccessful due to insufficient memory in the SGA.

So, what led to this????

Here we validated ORA-04031 tracefile.
In trace file, we observed below SQL execution which results into such a huge memory resize and ORA-04031 failure.

 
 LibraryHandle:  Address=700010048fbfb00 Hash=8346a122 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
   ObjectName:  Name=select /*+ PARALLEL(5000) */ to_char(a.COL1,:"SYS_B_0") COL2,COUNT(DISTINCT a.COL3) COL3,
 COUNT(DISTINCT a.COL4) COL5, COUNT(*)
 from  SCHEMA.TABLE1 a
 INNER JOIN SCHEMA.TABLE2 b
 ON a.COL2 = b.COL2
 where to_char(a.COL1,:"SYS_B_1") >=:"SYS_B_2" and to_char(a.COL1,:"SYS_B_3")<=:"SYS_B_4"
 group by to_char(a.COL1
     FullHashValue=4b87fbc472ca0b1c6e4ec1668346a122 Namespace=SQL AREA(00) SqlId=6wmq1cu1nd892 Type=CURSOR(00)
   Statistics:  InvalidationCount=0 ExecutionCount=1 LoadCount=2 ActiveLocks=961 TotalLockCount=961 TotalPinCount=1
   Counters:  BrokenCount=1 RevocablePointer=1 KeepDependency=1 Version=0 BucketInUse=961 HandleInUse=961 HandleReferenceCount=0 
   Concurrency:  DependencyMutex=700010048fbfbb0(0, 2, 0, 0) Mutex=700010048fbfc50(0, 1947, 840, 0)
   Flags=RON/PIN/TIM/PN0/DBN/[10012841] Flags2=[0000]
   WaitersLists:
     Lock=700010048fbfb90[700010048fbfb90,700010048fbfb90]
     Pin=700010048fbfb70[700010048fbfb70,700010048fbfb70]
     LoadLock=700010048fbfbe8[700010048fbfbe8,700010048fbfbe8]
   Timestamp:  Current=11-21-2024 21:27:23 

The SQL executed with the /*+ PARALLEL(5000) */ hint consumed all parallel processes, with parallel_max_servers=960. The Oracle engine then attempted to allocate the ‘PX msg pool‘ to accommodate this execution.

Finally entire SGA memory got full and execution failed with ORA-04031.

Solution Implemented

We recommend avoiding the execution of parallel statements on production systems. Instead, direct such operations to the standby or DR system.

Additionally, we advised modifying certain critical parameters to prevent similar resize operations in the future.

Lastly, we suggested removing all double-underscored parameters and re-creating the pfile/spfile with the updated parameter values. Other parameters should retain their standard, pre-existing values.

We successfully resolved all issues and the application now runs smoothly without any problems !!!!

alliswell

1 thought on “db file sequential read”

  1. Great explanation,thank you so much for sharing detailed explanation of the issue.Had few doubts in mind in this scenario probably it won’t try resisizing of shared pool and cache size if ASMM is enabled right if we have right amount of sga target/max size available for the db.

Leave a Reply

Your email address will not be published. Required fields are marked *