Table of Contents
Problem Statement
Production database was hosted on a 6 Node RAC setup and used for PeopleSoft application.
Dataguard configured with LOCAL FAR SYNC (same DC) and REMOTE FAR SYNC (Remote DC at 30Km).
When LOCAL FAR SYNC is enabled, application runs fluently.
With Remote FAR SYNC enabled, one of the PeopleSoft module “Integration Broker” was affected.
Slow IB messages flow observed while comparing with normal behavior.
Most of the messages were stuck with “Started” status.
What is PeopleSoft Integration Broker?
To perform asynchronous and synchronous messaging among internal systems and third-party systems.
Expose PeopleSoft business logic as web services to PeopleSoft and third-party systems.
Consume and invoke web services from third-party and PeopleSoft systems.
In this setup, asynchronous messaging architecture has been used.
Integration Broker asynchronous messaging architecture
The publication broker, publication contractor, and subscription contractor services are the primary application server elements required for asynchronous messaging. The publication broker service routes the workload to both contractor server processes.
Publication broker
Acts as the routing mechanism. When an asynchronous service operation arrives in its queue, the publication broker service runs the defined routing rules. If the service operation needs to be published to a remote node, it routes the service operation to the publication contractor service. If the service operation is subscribed to on the local node, then the publication broker routes the service operation to the subscription contractor service. Routing involves submitting either a subscription or publication contract to the appropriate contractor, followed by an asynchronous call to the contractor service notifying it that work is waiting in the queue.
Server Processes – Broker dispatcher (PSBRKDSP), Broker handler (PSBRKHND)
Publication contractor
References the publication contract submitted by the publication broker service and performs an HTTP post of the publication service operation to the integration gateway. When the integration gateway sends a reply indicating that it received the publication service operation, the publication contractor service updates the publication contract with the status of subscription processing (Done or Retry).
Server Processes – Publication dispatcher (PSPUBDSP), Publication handler (PSPUBHND)
Subscription contractor
References the subscription contract submitted by the publication broker service and runs the appropriate notification People Code. Then it updates the subscription contract concerning the status of the subscription processing.
Server Processes – Subscription dispatcher (PSSUBDSP), Subscription handler (PSSUBHND)
Note: At database server, we can see individual processes initiated by IB module with IB server processes caption.
Diagnosis
Dataguard configured with 12c FAST SYNC using below log_archive_dest_n parameter.
service=REMOTESER, SYNC NOAFFIRM delay=0 optional compression=enable max_failure=1 max_connections=1 reopen=5 db_unique_name=PRODREMOTE net_timeout=10, alternate=LOG_ARCHIVE_DEST_2 valid_for=(online_logfile, all_roles)
We observed increase in log file sync events with REMOTE FAR SYNC setup.
Average log file sync wait time was 30-40 milliseconds for REMOTE FAR SYNC as compared to 1-2 milliseconds for LOCAL FAR SYNC.
LGWR trace files have been verified to identify time required for WRITE operation.
LGWR Trace Excerpts
10034ms, size 54486KB 9687ms, size 68603KB 9491ms, size 34903KB 9464ms, size 62778KB 9327ms, size 53726KB 9272ms, size 7425KB 9256ms, size 51352KB 9176ms, size 30762KB
TIMEOUT messages from LGWR trace file
Making upidhs request to NSS4 (ocis 0x7fe123e688f8). Begin time is <04/23/2018 19:52:07> and NET_TIMEOUT <10> seconds Making upidhs request to NSS4 (ocis 0x7fe123e688f8). Begin time is <04/23/2018 19:52:37> and NET_TIMEOUT <10> seconds Making upidhs request to NSS4 (ocis 0x7fe123e688f8). Begin time is <04/23/2018 19:55:12> and NET_TIMEOUT <10> seconds Making upidhs request to NSS4 (ocis 0x7fe123e688f8). Begin time is <04/23/2018 19:58:29> and NET_TIMEOUT <10> seconds Making upidhs request to NSS4 (ocis 0x7fe123e688f8). Begin time is <04/23/2018 20:03:25> and NET_TIMEOUT <10> seconds Making upidhs request to NSS4 (ocis 0x7fe123e688f8). Begin time is <04/23/2018 20:04:39> and NET_TIMEOUT <10> seconds Making upidhs request to NSS4 (ocis 0x7fe123e688f8). Begin time is <04/23/2018 20:06:23> and NET_TIMEOUT <10> seconds
To further undermine the issue, lfsdiag output was extracted and examined to identify what LGWR was doing during issue period.
Excerpts from lfsdiag output
MINUTE INST PROGRAM EVENT TOTAL_WAIT_TIME WAITS AVG_TIME_WAITED ------------ ---- ------------------------------ ------------------------- ----------------- ---------- ------------------ May03_1140 5 [email protected] (LGWR) Redo Transport Open 3493.851 4 873.463 May03_1140 6 [email protected] (LGWR) target log write size .059 1 .059 May03_1141 1 [email protected] (LGWR) Redo Transport Open 3628.460 3 1209.487 May03_1141 1 [email protected] (LGWR) SYNC Remote Write .053 1 .053 May03_1141 2 [email protected] (LGWR) Redo Transport Open 3474.625 3 1158.208 May03_1141 2 [email protected] (LGWR) SYNC Remote Write .531 2 .266 May03_1141 3 [email protected] (LGWR) Redo Transport Open 3500.844 3 1166.948 May03_1141 3 [email protected] (LGWR) log file parallel write 2.981 3 .994 May03_1141 3 [email protected] (LGWR) SYNC Remote Write .310 1 .310 May03_1141 4 [email protected] (LGWR) Redo Transport Open 3480.429 4 870.107 May03_1141 4 [email protected] (LGWR) SYNC Remote Write 1.499 3 .500 May03_1141 5 [email protected] (LGWR) SYNC Remote Write .169 1 .169 May03_1141 6 [email protected] (LGWR) Redo Transport Open 3494.868 4 873.717 May03_1141 6 [email protected] (LGWR) log file parallel write 1.469 4 .367 May03_1141 6 [email protected] (LGWR) SYNC Remote Write .577 2 .289 May03_1142 1 [email protected] (LGWR) Redo Transport Open 11542.659 11 1049.333 May03_1142 1 [email protected] (LGWR) SYNC Remote Write 4435.589 19 233.452 May03_1142 3 [email protected] (LGWR) Redo Transport Open 11675.964 11 1061.451 May03_1142 4 [email protected] (LGWR) Redo Transport Open 610.381 1 610.381
“Redo Transport Open” was consistently high within issue period.
As per Oracle Documentation, this event has been used to track the time spent (in centiseconds) by ARCn, NSSn, and TTnn background processes doing RFSCREAT and RFSANNCE operations.
But similar event not seen with LOCAL FAR SYNC enabled.
This directed us to probe the network latency.
Network team confirmed FIREWALL configuration present between these sites and disabled it to identify the impact.
After disabling FIREWALL settings, “log file sync” average wait time has been reduced to 4-5 milliseconds.
We believed issue is resolved now.
Even after this wait time change, application team confirmed that they were still facing similar issue with IB messaging.
To further debugging, we have enabled NSS/RFS tracing to identify whether it’s remote IO issue or Network Latency problem.
How to Trace NSS/RFS?
Production DB
alter system set events ‘16410 trace name context forever, level 16’;
Remote FAR SYNC DB
alter system set events ‘16410 trace name context forever, level 16’;
alter system set events ‘10046 trace name context forever, level 12’;
To Turn Off
alter system set events ‘16410 trace name context off’;
alter system set events ‘10046 trace name context off’;
Excerpts from NSS/RFS trace file
Total time 221 milliseconds (585-364) for complete cycle NSS Trace File Client sending SQLNET request data [kpurcs] oper='Write' flag=67108994 thrd=1 seq=1140782 msgid=1104004 *** 2018-04-23 06:37:34.364142 8464 krsu.c *** 2018-04-23 06:37:34.585089 8623 krsu.c ... Client received SQLNET call [kpurcs] response oper='Write' flag=67108994 thrd=1 seq=1140782 msgid=1104004 WAIT #0: nam='Redo Transport MISC' ela= 221011 p1=0 p2=0 p3=0 obj#=-1 tim=2579405557992 WAIT #0: nam='rdbms ipc message' ela= 5 timeout=18 p2=0 p3=0 obj#=-1 tim=2579405558069 WAIT #0: nam='rdbms ipc message' ela= 180 timeout=18 p2=0 p3=0 obj#=-1 tim=2579405558267 WAIT #0: nam='LNS wait on LGWR' ela= 9 p1=0 p2=0 p3=0 obj#=-1 tim=2579405558298 Write time at fallback DB - 1 millisecond RFS trace file ... Server received SQLNET data [krsr_rfs_dsp] oper='Write' flag=67108994 thrd=1 seq=1140782 msgid=1104004 *** 2018-04-23 06:37:34.584297 1252 krsr.c WAIT #0: nam='RFS write' ela= 123 p1=0 p2=0 p3=0 obj#=-1 tim=4348055466688 ... Server finished processing SQLNET data [krsr_rfs_dsp] oper='Write' flag=67108994 thrd=1 seq=1140782 msgid=1104004 *** 2018-04-23 06:37:34.585471 1999 krsr.c WAIT #0: nam='RFS dispatch' ela= 92 p1=0 p2=0 p3=0 obj#=-1 tim=4348055466741 WAIT #0: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=4348055466756
Out of 221 milliseconds, only 1 millisecond was spent on fallback write operation and remaining (220 milliseconds) on network latency.
From above details, we were very much sure that NETWORK latency is not consistent and at some point of time it is reaching more than 200 milliseconds.
As per customer, they had already spent a lot to configure dedicated MPLS network for this configuration and may not be in position to configure Dark Fibre network with 1 millisecond latency right away. It may take 4-6 months to reconfigure the network entirely.
With this situation, we have to work on some workaround so that PeopleSoft IB module can survive with current network latency.
Now it’s time to understand application behavior with REMOTE FAR SYNC setup.
Next 2 days we thoroughly monitored the application performance with the PeopleSoft team and tried to identify the behavior.
Queries to verify IB status
select b.queuename,b.createdttm,b.lastupddttm, extract( second from (b.lastupddttm-b.createdttm)) from psapmsgpubhdr b where b.queuename in ('QNAME1','QNAME2') and trunc(createdttm)=trunc(sysdate-1) and b.pubstatus=4 order by 1,3; select b.queuename,b.createdttm,b.lastupddttm, extract( second from (b.lastupddttm-b.createdttm)) from psapmsgpubcon b where b.queuename in ('QNAME1','QNAME2') and trunc(createdttm)=trunc(sysdate-1) and b.pubconstatus=4 order by 1,3; select b.queuename,b.createdttm,b.lastupddttm, extract( second from (b.lastupddttm-b.createdttm)) from psapmsgsubcon b where b.queuename in ('QNAME1','QNAME2') and trunc(createdttm)=trunc(sysdate-1) and b.subconstatus=4 order by 1,3; select b.queuename,a.createdttm as HCrt,a.lastupddttm as HLUpdD , extract( MINUTE from (a.lastupddttm-a.createdttm)) as HDEpTime , b.createdttm as PCrt,b.lastupddttm as PLUpD, extract( SECOND from (b.lastupddttm-b.createdttm))as PCEpTime from psapmsgpubcon b ,psapmsgpubhdr a where b.queuename ='QNAME' and a.ibtransactionid=b.ibpubtransactid and trunc(b.lastupddttm) trunc(sysdate-1) and b.pubconstatus=4; select b.queuename,b.createdttm,b.lastupddttm,b.ibtransactionid, extract( HOUR from (b.lastupddttm-b.createdttm)) as HR, extract( MINUTE from (b.lastupddttm-b.createdttm)) as MN, extract( SECOND from (b.lastupddttm-b.createdttm)) as SE, a.createdttm,a.lastupddttm,a.ibpubtransactid, extract( HOUR from (a.lastupddttm-a.createdttm)) as HR, extract( MINUTE from (a.lastupddttm-a.createdttm)) as MN, extract( SECOND from (a.lastupddttm-a.createdttm)) as SE from psapmsgpubhdr b,psapmsgpubcon a where b.queuename ='QNAME' and b.ibtransactionid=a.ibpubtransactid and trunc(b.lastupddttm) = trunc(sysdate-1) and b.pubstatus=4 order by 2,3,8,9;
We observed, all the ORDERED queues were affected compared with UNORDERED queues.
What is ORDERED/UNORDERED QUEUE?
This checkbox can be selected to enable field partitioning and to process service operations unordered.
By default, the check box is cleared and inbound service operations that are assigned to a queue are processed one at a time sequentially in the order that they are sent.
Select to force the channel to handle all of its service operations in parallel (unordered), which does not guarantee a particular processing sequence.
This disables the channel’s partitioning fields.
Clear this check box if you want all of the queues service operations processed sequentially or if you want to use the partitioning fields.
We came to know 3 most important tables of IB module within which entire functionality is flowing.
PSAPMSGPUBHDR PSAPMSGSUBCON PSAPMSGPUBCON
How IB Message Flow’s?
For every new message, INSERT is happening in equivalent tables.
At every stage (New/Started/Working/Done/Timeout) UPDATE query is running for same IB transaction ID.
This entire operation performed by specific HANDLERS (IB Server Processes)
Each HANDLER will perform single operation only and release back for another request.
We observed elapsed time increased for all corresponding UPDATE queries.
TOP SQL’s
SQL_ID SQL_TEXT ------------- ------------------------------------------ gf3jhsbkks0up UPDATE PSAPMSGPUBHDR SET PUBSTATUS=:1, S TATUSSTRING=:2, RETRYCOUNT=RETRYCOUNT+1, LASTUPDDTTM=CAST(SYSTIMESTAMP AS TIMEST AMP), MACHINENAME=:3, PROCESSID=:4, SLAV EPROCESSNAME=:5, IB_SLAVEQUEUED=:6 WHERE IBTRANSACTIONID=:7 AND LASTUPDDTTM=TO_T IMESTAMP(:8,'YYYY-MM-DD-HH24.MI.SS.FF') AND PUBSTATUS=:9 g55mxy2xmq9b5 UPDATE PSAPMSGPUBHDR SET PUBSTATUS=:1, S TATUSSTRING=:2, MACHINENAME=:3, PROCESSI D=:4, LASTUPDDTTM=CAST(SYSTIMESTAMP AS T IMESTAMP), IB_SLAVEQUEUED=0 WHERE IBTRAN SACTIONID=:5 AND PUBSTATUS=:6 048znjmq3uvs9 SELECT DISTINCT A.QUEUENAME FROM PSAPMSG PUBHDR A, PSQUEUEDEFN B WHERE PUBSTATUS IN (1, 2) AND A.QUEUENAME=B.QUEUENAME AN D B.QUEUESTATUS<>2 11ukgdk961fn9 SELECT A.QUEUENAME FROM PSAPMSGPUBCON A, PSQUEUEDEFN B WHERE PUBCONSTATUS IN (1, 2, 5) AND A.QUEUENAME=B.QUEUENAME AND B .QUEUESTATUS<>2 ORDER BY A.LASTUPDDTTM 08jrwnqh17pb8 SELECT IBTRANSACTIONID, EXTOPERATIONNAME , IB_OPERATIONNAME, EXTERNALMESSAGEID, P UBNODE, QUEUENAME, QUEUESEQID, SUBQUEUE, ORIGPUBNODE, PUBCLASS, PUBLISHER, PUBPR OC, TRXTYPE, TO_CHAR(CAST((CREATEDTTM) A S TIMESTAMP),'YYYY-MM-DD-HH24.MI.SS.FF') , TO_CHAR(CAST((PUBLISHTIMESTAMP) AS TIM ESTAMP),'YYYY-MM-DD-HH24.MI.SS.FF'), NRI D, DESTPUBNODE, CONVERSATIONID, INREPLYT OID, ISREQUEST, CANONICALTRSFRMID, PUBST ATUS, TO_CHAR(CAST((LASTUPDDTTM) AS TIME STAMP),'YYYY-MM-DD-HH24.MI.SS.FF'), RETR YCOUNT, MACHINENAME, SLAVEPROCESSNAME, I B_SLAVEQUEUED, PROCESSID FROM PSAPMSGPUB HDR WHERE QUEUENAME=:1 AND PUBSTATUS<>9 AND LASTUPDDTTM >= TO_TIMESTAMP(:2,'YYYY -MM-DD-HH24.MI.SS.FF') AND IB_SLAVEQUEUE D IN (0) ORDER BY SUBQUEUE, PUBLISHTIMES TAMP, QUEUESEQID
So, What Led to this….????
We observed very high cluster waits with REMOTE FAR SYNC enabled as compared with LOCAL FAR SYNC.
Similar Impact has been seen at SQL level as well.
As per application design, it contains multiple handler to process messages.
At database layer, each handler is having 1 database connection.
These connections are spread across multiple instances when RAC is enabled.
During IB message processing, single insert is happening on MSGHDR table for individual message and then subsequent updates at each stage for same transaction ID.
With LOCAL FAR SYNC enabled, query elapsed time is well within 3-10 milliseconds for each execution which prevents concurrent request from multiple instances and avoids GC impact.
When REMOTE FAR SYNC is enabled, elapsed time increased up to 10-15 milliseconds for first execution because of increased latency.
At the same time, request are coming from corresponding instances to update similar records results into simultaneous request with increased GC waits.
Execution drift with REMOTE FAR SYNC
- 1st execution on Instance-1 taking 10ms.
- Because of increase in execution time due to high latency, 2nd instance requesting similar block to update identical IB transaction and waits for additional 5-10 milliseconds with GC events.
- Once block is received query executed with total elapsed time of 20ms (GC time 10ms + Execution Time 10ms)
- Now for same IB transaction, another instance is requesting same block and waits for 20ms on GC wait.
- Once received, total elapsed time increased up to 30ms (GC Time 20ms + Execution time 10ms)
- Similar behavior can be seen with all remaining instances.
- As a cascading impact, queries elapsed time are increasing and causing overall slowness with GC wait events.
Such circumstances cannot occur with LOCAL FAR SYNC setup as queries elapsed time are well below, with respect to concurrent request.
Solution Implemented
RAC service has been created with PREFERED and AVAILABLE instances to turn off the cluster GC impact.
srvctl add service -db DBNAME -service SERVICENAME -preferred HOST01 -available HOST02
IB module pointed to newly created service.
Post this change IB Performance issue has been resolved and MESSAGE flow never encountered waits on “Started” state and completed within expected concurrency.
Superb Post.. Detailed and point to point explanation !!!!