Table of Contents
Problem Statement
The application team has reported a recurring transaction timeout occurring every Thursday at 9:00 PM. This issue persists for approximately 5 minutes and typically resolves itself automatically. However, during the occurrence of the issue, they were unable to establish a new connection between application and database.
Diagnosis
We initiated our investigation by referring to the problem statement. Our next step was to promptly examine the database alert log at 9 pm. Within the database alert log, we observed the following error messages.
2023-10-12T21:02:23.071996+05:30 Unified Audit record write to audit trail table failed due to ORA-18. Writing the record to OS spillover file. 2023-10-12T21:02:23.079415+05:30 Unified Audit record write to audit trail table failed due to ORA-18. Writing the record to OS spillover file. . . . . Errors in file /diagdumps/DBNAME/diag/rdbms/DBNAME_prod/DBNAME1/trace/DBNAME1_cl04_41682556.trc: ORA-00018: maximum number of sessions exceeded 2023-10-12T21:03:43.826534+05:30 Errors in file /diagdumps/DBNAME/diag/rdbms/DBNAME_prod/DBNAME1/trace/DBNAME1_cl04_41682556.trc: ORA-00018: maximum number of sessions exceeded 2023-10-12T21:03:43.827711+05:30 Errors in file /diagdumps/DBNAME/diag/rdbms/DBNAME_prod/DBNAME1/trace/DBNAME1_cl04_41682556.trc: ORA-00018: maximum number of sessions exceeded
The error message was a clear indicator that the database was excessively utilising all of its allocated processes (6000). This prevents the application from establishing new connections.
Typically, the application maintains a connection range of 3000-4000, with only 50-100 sessions in an active state at any given moment. A review of the historical process utilization suggests that the application was consistently establishing new connections at a rapid pace during issue timeframe. This excessive consumption exceeds the allotted database processes.
To conduct a more thorough assessment of the connectivity, we extracted data from the listener.log file.
cat listener.log | awk '{print $1" "$2}' | grep "12-OCT-2023 21:" | uniq -c Connections Per seconds Time ----------------------- -------------------- 1 12-OCT-2023 21:00:06 1 12-OCT-2023 21:00:10 1 12-OCT-2023 21:00:11 2 12-OCT-2023 21:00:15 2 12-OCT-2023 21:00:47 1 12-OCT-2023 21:00:51 32 12-OCT-2023 21:00:58 60 12-OCT-2023 21:00:59 59 12-OCT-2023 21:01:00 59 12-OCT-2023 21:01:01 55 12-OCT-2023 21:01:02 62 12-OCT-2023 21:01:03 57 12-OCT-2023 21:01:04 59 12-OCT-2023 21:01:05 57 12-OCT-2023 21:01:06 60 12-OCT-2023 21:01:07 58 12-OCT-2023 21:01:08 52 12-OCT-2023 21:01:09 57 12-OCT-2023 21:01:10 56 12-OCT-2023 21:01:11 60 12-OCT-2023 21:01:12 56 12-OCT-2023 21:01:13 60 12-OCT-2023 21:01:14 44 12-OCT-2023 21:01:15 59 12-OCT-2023 21:01:16 55 12-OCT-2023 21:01:17 58 12-OCT-2023 21:01:18 57 12-OCT-2023 21:01:19 52 12-OCT-2023 21:01:20 45 12-OCT-2023 21:01:21 55 12-OCT-2023 21:01:22 55 12-OCT-2023 21:01:23 54 12-OCT-2023 21:01:24 58 12-OCT-2023 21:01:25 38 12-OCT-2023 21:01:26 55 12-OCT-2023 21:01:27 57 12-OCT-2023 21:01:28
Typically, under normal circumstances, the application establishes a single-digit number of connections per second.
However, the data above indicates a significantly elevated connection rate. This leads to the occurrence of the “ORA-00018: maximum number of sessions exceeded“ error.
We attempted to analyse the database wait events that were causing the majority of active sessions in a waiting state.
The database experienced significant waits on “enq: US – contention” and “row cache lock” contention. AWR reports revealed a noticeable increase in “rollback per second” when compared to good periods.
We inquired with the application team if any jobs or services being initiated during this period, potentially causing the increased connection rate. It didn’t happen that way.
They may also be executing an abrupt termination operation that compels transactions to roll back.
To gain a comprehensive understanding of their processes, we conducted an in-depth call with the application team.
After discussing over call, we discovered that the application was encountering a significant number of “ORA-00001 unique key constraint” errors when attempting regular insert operations at the front end. Typically, this issue arises when the application exceeds its defined timeout thresholds and pending INSERT operations are resubmitted to the database. This resulting into unique constraint violations.
As the application experience timeouts, connections were abruptly terminated from the front end. This leads to the rollback of ongoing DML operations, consequently causing a contention within UNDO operations.
Further investigation revealed that the application would retry insert operations when it didn’t receive a timely response. This behavior was due to the application’s inherent characteristics. Due to these repeated failures, the application kept creating new connections to the database and exhausting available processes.
We understood the reason behind “ORA-00001 unique key constraint” error and the high UNDO contention.
However, we still had questions about the timeouts the application faced.
This prompted us to again investigate within the database.
We revisited the database alert log on a different Thursday and discovered a handful of interesting errors, as listed below.
2023-10-19T21:01:03.133513+05:30 KQR row cache hit _kqr_max_hot_copies limit 16 cid 0 bucket 3 2023-10-19T21:01:54.729526+05:30 LGWR (PID:6620070): ORA-16198: Received timed out error from KSR LGWR (PID:6620070): Attempting LAD:3 network reconnect (16198) LGWR (PID:6620070): LAD:3 network reconnect abandoned 2023-10-19T21:01:54.741196+05:30 Errors in file /diagdumps/DBNAME/diag/rdbms/DBNAME_prod/DBNAME1/trace/DBNAME1_lgwr_6620070.trc: ORA-16198: Timeout incurred on internal channel during remote archival LGWR (PID:6620070): Error 16198 for LNO:2 to 'DBNAMEFL' 2023-10-19T21:01:54.761111+05:30 LGWR (PID:6620070): LAD:3 is UNSYNCHRONIZED LGWR (PID:6620070): Failed to archive LNO:2 T-1.S-162807, error=16198 . . . . 2023-10-19T21:02:26.418149+05:30 ORA-00020: maximum number of processes (6000) exceeded ORA-20 errors will not be written to the alert log for the next minute. Please look at trace files to see all the ORA-20 errors.
The aforementioned errors strongly indicated the usage of a Maximum Availability architecture for redo transfer. The database encountered timeouts when transmitting redo data to a remote instance in SYNC mode. Our immediate response was to investigate the LGWR trace file for both Thursday. Here we noticed significant delays during the log writing process.
*** 2023-10-12T21:01:04.831697+05:30 Warning: log write elapsed time 1357ms, size 1KB *** 2023-10-12T21:01:06.575859+05:30 Warning: log write elapsed time 1370ms, size 22KB . . . . *** 2023-10-12T21:01:45.422001+05:30 Warning: log write elapsed time 2388ms, size 2KB *** 2023-10-12T21:01:48.012110+05:30 Warning: log write elapsed time 2462ms, size 0KB
*** 2023-10-19T21:00:36.590912+05:30 Warning: log write elapsed time 1089ms, size 3KB *** 2023-10-19T21:00:38.040331+05:30 Warning: log write elapsed time 1294ms, size 3KB . . . . *** 2023-10-19T21:01:11.428603+05:30 Warning: log write elapsed time 3769ms, size 14KB *** 2023-10-19T21:01:14.862758+05:30 Warning: log write elapsed time 3245ms, size 5KB *** 2023-10-19 21:01:54.726835 [krsw.c:3637] krsw_ksr_receive: Current time <10/19/2023 21:01:49> exceeds begin time <10/19/2023 21:01:49> by <5> seconds, return timed out error *** 2023-10-19 21:01:54.729546 [krsh.c:6397] ORA-16198: Received timed out error from KSR *** 2023-10-19 21:01:54.734251 [krsh.c:6397] Attempting LAD:3 network reconnect (16198) *** 2023-10-19 21:01:54.737572 [krsh.c:6397] LAD:3 network reconnect abandoned krsw_ksr_status: Received error 16198 on receiving channel messages at 0xfffffffffff6e30 placed krsl.c@7125 ORA-16198: Timeout incurred on internal channel during remote archival at 0xfffffffffffb548 placed ksb.c@4185 *** 2023-10-19 21:01:54.747709 [krsh.c:6397] Error 16198 for LNO:2 to 'DBNAMEFL' *** 2023-10-19 21:01:54.747742 [krsi.c:8750] krsi_dst_fail_caller: LAD:3 err:16198 force:0 blast:1
During regular hours, the “log write elapsed time” message occurred infrequently, but every Thursday at 9 PM, many such messages appeared.
To address this issue, the most straightforward approach was to deactivate the “Maximum Availability” mode.
However, this solution could potentially compromise the database’s ZERO DATA LOSS architecture.
We chose not to disable Maximum Availability mode. Opted to continue our diagnostic in order to identify the root cause of the problem.
We validated database server to investigate scheduled cronjobs during issue timeframe, but no such jobs were identified.
The application team had already confirmed that they did not have any scheduled tasks that could be causing this issue.
Our next step was to diagnose the root cause of the increased “log write elapsed time”.
We extracted I/O statistics from the local server using OSWATCHER, and the service time remained within acceptable limits.
Typically, when encountering log write delay related to a REMOTE instance, our initial instinct is to attribute network delays.
However, it raises a significant question: If the network were indeed the culprit, why does it consistently manifest itself every Thursday at 9 PM?
We made the choice to explore the remote server basis TNSENTRY DBNAMEFL.
On the remote server, OSWATCHER was not active. The database was in standby mode. We did not have any historical AWR or ASH reports during issue time.
We examined the remote database’s alert log, but it did not shows any errors within the issue timeframe.
Now…… What next????
As remote instance was running on an AIX server, we retrieved NMON data and created graphs using NMONVisualizer.
The insights obtained from the graphs were truly captivating.
(Note: The graphs represent UTC+5 Hrs. Please subtract 30 mins while reading the below graphs.)
The above graphs indicate a significant spike in IO rate on the remote instance during the problematic time period. Additionally, the kernel was consistently consuming maximum CPU utilization. Consequently, we reach out to local DBA/SYSTEM team to investigate the root cause of this elevated IO rate.
So, what led to this????
Full database backup was scheduled on remote instance for every Thursday at 9 PM.
----------------------------------------------------- ORACLE RMAN BACKUP STATUS ----------------------------------------------------- Netbackup Server : bkphostname Netbackup Client : dbhostname Database Name : dbname2 Policy Name : dbhostname-RMAN-dbname2-FULL Back Type : INCREMENTAL LEVEL=0 RMAN TAG Name : TAG dbname2_full_hot_5_Oct_2023 Start Date : 5 Oct 2023 21:00:01 End Date : 6 Oct 2023 00:53:25 Status : ended successfully ----------------------------------------------------- ----------------------------------------------------- ORACLE RMAN BACKUP STATUS ----------------------------------------------------- Netbackup Server : bkphostname Netbackup Client : dbhostname Database Name : dbname2 Policy Name : dbhostname-RMAN-dbname2-FULL Back Type : INCREMENTAL LEVEL=0 RMAN TAG Name : TAG dbname2_full_hot_12_Oct_2023 Start Date : 12 Oct 2023 21:00:01 End Date : 13 Oct 2023 00:58:15 Status : ended successfully -----------------------------------------------------
During the backup initialisation phase, it exhibited excessive resource consumption on the remote instance. This leads to a consequential delay in remote write operations on the production database.
Solution Implemented
Rescheduled the backup for the midnight time window.
Following this, the application did not encountered any further instances of timeouts.
Very good Analysis.
Remote instance in this case .. is it a standby server ?? Where backup was scheduled?
Yes. It’s a Standby Database server.
Very in depth analysis. Very good article.