PeopleSoft "Integration Broker" Performance Issue after Dataguard Remote "FAR SYNC" implementation : Debugging Oracle Performance

Table of Contents

Problem Statement

Production database was hosted on a 6 Node RAC setup and used for PeopleSoft application.
Dataguard configured with LOCAL FAR SYNC (same DC) and REMOTE FAR SYNC (Remote DC at 30Km).
When LOCAL FAR SYNC is enabled, application runs fluently.
With Remote FAR SYNC enabled, one of the PeopleSoft module “Integration Broker” was affected.
Slow IB messages flow observed while comparing with normal behavior.
Most of the messages were stuck with “Started” status.

What is PeopleSoft Integration Broker?

To perform asynchronous and synchronous messaging among internal systems and third-party systems.
Expose PeopleSoft business logic as web services to PeopleSoft and third-party systems.
Consume and invoke web services from third-party and PeopleSoft systems.

In this setup, asynchronous messaging architecture has been used.

Integration Broker asynchronous messaging architecture

The publication broker, publication contractor, and subscription contractor services are the primary application server elements required for asynchronous messaging. The publication broker service routes the workload to both contractor server processes.

Publication broker
Acts as the routing mechanism. When an asynchronous service operation arrives in its queue, the publication broker service runs the defined routing rules. If the service operation needs to be published to a remote node, it routes the service operation to the publication contractor service. If the service operation is subscribed to on the local node, then the publication broker routes the service operation to the subscription contractor service. Routing involves submitting either a subscription or publication contract to the appropriate contractor, followed by an asynchronous call to the contractor service notifying it that work is waiting in the queue.
Server Processes – Broker dispatcher (PSBRKDSP), Broker handler (PSBRKHND)

Publication contractor
References the publication contract submitted by the publication broker service and performs an HTTP post of the publication service operation to the integration gateway. When the integration gateway sends a reply indicating that it received the publication service operation, the publication contractor service updates the publication contract with the status of subscription processing (Done or Retry).
Server Processes – Publication dispatcher (PSPUBDSP), Publication handler (PSPUBHND)

Subscription contractor
References the subscription contract submitted by the publication broker service and runs the appropriate notification People Code. Then it updates the subscription contract concerning the status of the subscription processing.
Server Processes – Subscription dispatcher (PSSUBDSP), Subscription handler (PSSUBHND)

Note: At database server, we can see individual processes initiated by IB module with IB server processes caption.

Diagnosis

Dataguard configured with 12c FAST SYNC using below log_archive_dest_n parameter.

 
 service=REMOTESER, 
 SYNC NOAFFIRM 
 delay=0 optional 
 compression=enable 
 max_failure=1 
 max_connections=1 
 reopen=5 
 db_unique_name=PRODREMOTE 
 net_timeout=10, 
 alternate=LOG_ARCHIVE_DEST_2 valid_for=(online_logfile, all_roles)

We observed increase in log file sync events with REMOTE FAR SYNC setup.
Average log file sync wait time was 30-40 milliseconds for REMOTE FAR SYNC as compared to 1-2 milliseconds for LOCAL FAR SYNC.
LGWR trace files have been verified to identify time required for WRITE operation.

LGWR Trace Excerpts

  
 10034ms, size 54486KB 
 9687ms, size 68603KB
 9491ms, size 34903KB
 9464ms, size 62778KB
 9327ms, size 53726KB
 9272ms, size 7425KB
 9256ms, size 51352KB
 9176ms, size 30762KB

TIMEOUT messages from LGWR trace file

  
 Making upidhs request to NSS4 (ocis 0x7fe123e688f8). Begin time is <04/23/2018 19:52:07> and NET_TIMEOUT <10> seconds 
 Making upidhs request to NSS4 (ocis 0x7fe123e688f8). Begin time is <04/23/2018 19:52:37> and NET_TIMEOUT <10> seconds
 Making upidhs request to NSS4 (ocis 0x7fe123e688f8). Begin time is <04/23/2018 19:55:12> and NET_TIMEOUT <10> seconds
 Making upidhs request to NSS4 (ocis 0x7fe123e688f8). Begin time is <04/23/2018 19:58:29> and NET_TIMEOUT <10> seconds
 Making upidhs request to NSS4 (ocis 0x7fe123e688f8). Begin time is <04/23/2018 20:03:25> and NET_TIMEOUT <10> seconds
 Making upidhs request to NSS4 (ocis 0x7fe123e688f8). Begin time is <04/23/2018 20:04:39> and NET_TIMEOUT <10> seconds
 Making upidhs request to NSS4 (ocis 0x7fe123e688f8). Begin time is <04/23/2018 20:06:23> and NET_TIMEOUT <10> seconds

To further undermine the issue, lfsdiag output was extracted and examined to identify what LGWR was doing during issue period.

Excerpts from lfsdiag output

 
 MINUTE       INST PROGRAM                        EVENT                       TOTAL_WAIT_TIME      WAITS   AVG_TIME_WAITED
 ------------ ---- ------------------------------ ------------------------- ----------------- ---------- ------------------
 May03_1140      5 oracle@host05.dom.com (LGWR)   Redo Transport Open                3493.851          4           873.463
 May03_1140      6 oracle@host06.dom.com (LGWR)   target log write size                  .059          1              .059
 May03_1141      1 oracle@host01.dom.com (LGWR)   Redo Transport Open                3628.460          3          1209.487
 May03_1141      1 oracle@host01.dom.com (LGWR)   SYNC Remote Write                      .053          1              .053
 May03_1141      2 oracle@host02.dom.com (LGWR)   Redo Transport Open                3474.625          3          1158.208
 May03_1141      2 oracle@host02.dom.com (LGWR)   SYNC Remote Write                      .531          2              .266
 May03_1141      3 oracle@host03.dom.com (LGWR)   Redo Transport Open                3500.844          3          1166.948
 May03_1141      3 oracle@host03.dom.com (LGWR)   log file parallel write               2.981          3              .994
 May03_1141      3 oracle@host03.dom.com (LGWR)   SYNC Remote Write                      .310          1              .310
 May03_1141      4 oracle@host04.dom.com (LGWR)   Redo Transport Open                3480.429          4           870.107
 May03_1141      4 oracle@host04.dom.com (LGWR)   SYNC Remote Write                     1.499          3              .500
 May03_1141      5 oracle@host05.dom.com (LGWR)   SYNC Remote Write                      .169          1              .169
 May03_1141      6 oracle@host06.dom.com (LGWR)   Redo Transport Open                3494.868          4           873.717
 May03_1141      6 oracle@host06.dom.com (LGWR)   log file parallel write               1.469          4              .367
 May03_1141      6 oracle@host06.dom.com (LGWR)   SYNC Remote Write                      .577          2              .289
 May03_1142      1 oracle@host01.dom.com (LGWR)   Redo Transport Open               11542.659         11          1049.333
 May03_1142      1 oracle@host01.dom.com (LGWR)   SYNC Remote Write                  4435.589         19           233.452
 May03_1142      3 oracle@host03.dom.com (LGWR)   Redo Transport Open               11675.964         11          1061.451
 May03_1142      4 oracle@host04.dom.com (LGWR)   Redo Transport Open                 610.381          1           610.381

“Redo Transport Open” was consistently high within issue period.
As per Oracle Documentation, this event has been used to track the time spent (in centiseconds) by ARCn, NSSn, and TTnn background processes doing RFSCREAT and RFSANNCE operations.

But similar event not seen with LOCAL FAR SYNC enabled.

This directed us to probe the network latency.
Network team confirmed FIREWALL configuration present between these sites and disabled it to identify the impact.

After disabling FIREWALL settings, “log file sync” average wait time has been reduced to 4-5 milliseconds.
We believed issue is resolved now.

Even after this wait time change, application team confirmed that they were still facing similar issue with IB messaging.
To further debugging, we have enabled NSS/RFS tracing to identify whether it’s remote IO issue or Network Latency problem.

How to Trace NSS/RFS?

Production DB
alter system set events ‘16410 trace name context forever, level 16’;

Remote FAR SYNC DB
alter system set events ‘16410 trace name context forever, level 16’;
alter system set events ‘10046 trace name context forever, level 12’;

To Turn Off
alter system set events ‘16410 trace name context off’;
alter system set events ‘10046 trace name context off’;

Excerpts from NSS/RFS trace file

 
 Total time 221 milliseconds (585-364) for complete cycle
 
 NSS Trace File
 Client sending SQLNET request data [kpurcs] oper='Write' flag=67108994 thrd=1 seq=1140782 msgid=1104004
 *** 2018-04-23 06:37:34.364142 8464 krsu.c
 *** 2018-04-23 06:37:34.585089 8623 krsu.c
 ... Client received SQLNET call [kpurcs] response oper='Write' flag=67108994 thrd=1 seq=1140782 msgid=1104004
 WAIT #0: nam='Redo Transport MISC' ela= 221011 p1=0 p2=0 p3=0 obj#=-1 tim=2579405557992
 WAIT #0: nam='rdbms ipc message' ela= 5 timeout=18 p2=0 p3=0 obj#=-1 tim=2579405558069
 WAIT #0: nam='rdbms ipc message' ela= 180 timeout=18 p2=0 p3=0 obj#=-1 tim=2579405558267
 WAIT #0: nam='LNS wait on LGWR' ela= 9 p1=0 p2=0 p3=0 obj#=-1 tim=2579405558298
 
 Write time at fallback DB - 1 millisecond
 
 RFS trace file
 ... Server received SQLNET data [krsr_rfs_dsp] oper='Write' flag=67108994 thrd=1 seq=1140782 msgid=1104004
 *** 2018-04-23 06:37:34.584297 1252 krsr.c
 WAIT #0: nam='RFS write' ela= 123 p1=0 p2=0 p3=0 obj#=-1 tim=4348055466688
 ... Server finished processing SQLNET data [krsr_rfs_dsp] oper='Write' flag=67108994 thrd=1 seq=1140782 msgid=1104004  
 *** 2018-04-23 06:37:34.585471 1999 krsr.c
 WAIT #0: nam='RFS dispatch' ela= 92 p1=0 p2=0 p3=0 obj#=-1 tim=4348055466741
 WAIT #0: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=4348055466756

Out of 221 milliseconds, only 1 millisecond was spent on fallback write operation and remaining (220 milliseconds) on network latency.

From above details, we were very much sure that NETWORK latency is not consistent and at some point of time it is reaching more than 200 milliseconds.

As per customer, they had already spent a lot to configure dedicated MPLS network for this configuration and may not be in position to configure Dark Fibre network with 1 millisecond latency right away. It may take 4-6 months to reconfigure the network entirely.

With this situation, we have to work on some workaround so that PeopleSoft IB module can survive with current network latency.

Now it’s time to understand application behavior with REMOTE FAR SYNC setup.

Next 2 days we thoroughly monitored the application performance with the PeopleSoft team and tried to identify the behavior.

Queries to verify IB status

  
 select b.queuename,b.createdttm,b.lastupddttm, extract( second from (b.lastupddttm-b.createdttm))  
 from psapmsgpubhdr b 
 where b.queuename in ('QNAME1','QNAME2')
 and trunc(createdttm)=trunc(sysdate-1)
 and b.pubstatus=4
 order by 1,3;
 
 select b.queuename,b.createdttm,b.lastupddttm, extract( second from (b.lastupddttm-b.createdttm)) 
 from psapmsgpubcon b 
 where b.queuename in ('QNAME1','QNAME2')
 and trunc(createdttm)=trunc(sysdate-1)
 and b.pubconstatus=4
 order by 1,3;
 
 select b.queuename,b.createdttm,b.lastupddttm, extract( second from (b.lastupddttm-b.createdttm)) 
 from psapmsgsubcon b 
 where b.queuename in ('QNAME1','QNAME2')
 and trunc(createdttm)=trunc(sysdate-1)
 and b.subconstatus=4
 order by 1,3;
 
 select b.queuename,a.createdttm as HCrt,a.lastupddttm as HLUpdD ,
 	extract( MINUTE  from (a.lastupddttm-a.createdttm)) as HDEpTime ,
	b.createdttm as PCrt,b.lastupddttm as PLUpD,
 	extract( SECOND from (b.lastupddttm-b.createdttm))as PCEpTime 
 from psapmsgpubcon  b ,psapmsgpubhdr a 
 where b.queuename ='QNAME'
 and a.ibtransactionid=b.ibpubtransactid
 and trunc(b.lastupddttm) trunc(sysdate-1)
 and b.pubconstatus=4;
 
 select b.queuename,b.createdttm,b.lastupddttm,b.ibtransactionid, 
 	extract( HOUR   from (b.lastupddttm-b.createdttm)) as HR, 
 	extract( MINUTE from (b.lastupddttm-b.createdttm)) as MN, 
 	extract( SECOND from (b.lastupddttm-b.createdttm)) as SE, 
 	a.createdttm,a.lastupddttm,a.ibpubtransactid, 
 	extract( HOUR   from (a.lastupddttm-a.createdttm)) as HR, 
 	extract( MINUTE from (a.lastupddttm-a.createdttm)) as MN, 
 	extract( SECOND from (a.lastupddttm-a.createdttm)) as SE 
 from psapmsgpubhdr b,psapmsgpubcon a  
 where b.queuename ='QNAME' 
 and b.ibtransactionid=a.ibpubtransactid 
 and trunc(b.lastupddttm) = trunc(sysdate-1)
 and b.pubstatus=4  
 order by 2,3,8,9;

We observed, all the ORDERED queues were affected compared with UNORDERED queues.

What is ORDERED/UNORDERED QUEUE?

This checkbox can be selected to enable field partitioning and to process service operations unordered.
By default, the check box is cleared and inbound service operations that are assigned to a queue are processed one at a time sequentially in the order that they are sent.
Select to force the channel to handle all of its service operations in parallel (unordered), which does not guarantee a particular processing sequence.
This disables the channel’s partitioning fields.
Clear this check box if you want all of the queues service operations processed sequentially or if you want to use the partitioning fields.

We came to know 3 most important tables of IB module within which entire functionality is flowing.

 
 PSAPMSGPUBHDR 
 PSAPMSGSUBCON
 PSAPMSGPUBCON

How IB Message Flow’s?

For every new message, INSERT is happening in equivalent tables.
At every stage (New/Started/Working/Done/Timeout) UPDATE query is running for same IB transaction ID.
This entire operation performed by specific HANDLERS (IB Server Processes)
Each HANDLER will perform single operation only and release back for another request.

We observed elapsed time increased for all corresponding UPDATE queries.

TOP SQL’s

  
 SQL_ID        SQL_TEXT                                 
 ------------- ------------------------------------------ 
 gf3jhsbkks0up UPDATE PSAPMSGPUBHDR SET PUBSTATUS=:1, S 
               TATUSSTRING=:2, RETRYCOUNT=RETRYCOUNT+1,
                LASTUPDDTTM=CAST(SYSTIMESTAMP AS TIMEST
               AMP), MACHINENAME=:3, PROCESSID=:4, SLAV
               EPROCESSNAME=:5, IB_SLAVEQUEUED=:6 WHERE
                IBTRANSACTIONID=:7 AND LASTUPDDTTM=TO_T
               IMESTAMP(:8,'YYYY-MM-DD-HH24.MI.SS.FF')
               AND PUBSTATUS=:9
 
 g55mxy2xmq9b5 UPDATE PSAPMSGPUBHDR SET PUBSTATUS=:1, S 
               TATUSSTRING=:2, MACHINENAME=:3, PROCESSI
               D=:4, LASTUPDDTTM=CAST(SYSTIMESTAMP AS T
               IMESTAMP), IB_SLAVEQUEUED=0 WHERE IBTRAN
               SACTIONID=:5 AND PUBSTATUS=:6
 
 048znjmq3uvs9 SELECT DISTINCT A.QUEUENAME FROM PSAPMSG 
               PUBHDR A, PSQUEUEDEFN B WHERE PUBSTATUS
               IN (1, 2) AND A.QUEUENAME=B.QUEUENAME AN
               D B.QUEUESTATUS<>2
 
 11ukgdk961fn9 SELECT A.QUEUENAME FROM PSAPMSGPUBCON A, 
                PSQUEUEDEFN B WHERE PUBCONSTATUS IN (1,
                2, 5) AND A.QUEUENAME=B.QUEUENAME AND B
               .QUEUESTATUS<>2 ORDER BY A.LASTUPDDTTM
 
 08jrwnqh17pb8 SELECT IBTRANSACTIONID, EXTOPERATIONNAME 
               , IB_OPERATIONNAME, EXTERNALMESSAGEID, P
               UBNODE, QUEUENAME, QUEUESEQID, SUBQUEUE,
                ORIGPUBNODE, PUBCLASS, PUBLISHER, PUBPR
               OC, TRXTYPE, TO_CHAR(CAST((CREATEDTTM) A
               S TIMESTAMP),'YYYY-MM-DD-HH24.MI.SS.FF')
               , TO_CHAR(CAST((PUBLISHTIMESTAMP) AS TIM
               ESTAMP),'YYYY-MM-DD-HH24.MI.SS.FF'), NRI
               D, DESTPUBNODE, CONVERSATIONID, INREPLYT
               OID, ISREQUEST, CANONICALTRSFRMID, PUBST
               ATUS, TO_CHAR(CAST((LASTUPDDTTM) AS TIME
               STAMP),'YYYY-MM-DD-HH24.MI.SS.FF'), RETR
               YCOUNT, MACHINENAME, SLAVEPROCESSNAME, I
               B_SLAVEQUEUED, PROCESSID FROM PSAPMSGPUB
               HDR WHERE QUEUENAME=:1 AND PUBSTATUS<>9
               AND LASTUPDDTTM >= TO_TIMESTAMP(:2,'YYYY
               -MM-DD-HH24.MI.SS.FF') AND IB_SLAVEQUEUE
               D IN (0) ORDER BY SUBQUEUE, PUBLISHTIMES
               TAMP, QUEUESEQID

So, What Led to this….????

We observed very high cluster waits with REMOTE FAR SYNC enabled as compared with LOCAL FAR SYNC.

Graph1

Graph2

Similar Impact has been seen at SQL level as well.

Graph3

Graph4

As per application design, it contains multiple handler to process messages.
At database layer, each handler is having 1 database connection.
These connections are spread across multiple instances when RAC is enabled.

During IB message processing, single insert is happening on MSGHDR table for individual message and then subsequent updates at each stage for same transaction ID.

With LOCAL FAR SYNC enabled, query elapsed time is well within 3-10 milliseconds for each execution which prevents concurrent request from multiple instances and avoids GC impact.

When REMOTE FAR SYNC is enabled, elapsed time increased up to 10-15 milliseconds for first execution because of increased latency.
At the same time, request are coming from corresponding instances to update similar records results into simultaneous request with increased GC waits.

Execution drift with REMOTE FAR SYNC

1st execution on Instance-1 taking 10ms.
Because of increase in execution time due to high latency, 2nd instance requesting similar block to update identical IB transaction and waits for additional 5-10 milliseconds with GC events.
Once block is received query executed with total elapsed time of 20ms (GC time 10ms + Execution Time 10ms)
Now for same IB transaction, another instance is requesting same block and waits for 20ms on GC wait.
Once received, total elapsed time increased up to 30ms (GC Time 20ms + Execution time 10ms)
Similar behavior can be seen with all remaining instances.
As a cascading impact, queries elapsed time are increasing and causing overall slowness with GC wait events.

Such circumstances cannot occur with LOCAL FAR SYNC setup as queries elapsed time are well below, with respect to concurrent request.

Solution Implemented

RAC service has been created with PREFERED and AVAILABLE instances to turn off the cluster GC impact.

srvctl add service -db DBNAME -service SERVICENAME -preferred HOST01 -available HOST02

IB module pointed to newly created service.
Post this change IB Performance issue has been resolved and MESSAGE flow never encountered waits on “Started” state and completed within expected concurrency.

Blog

PeopleSoft “Integration Broker” Performance Issue after Dataguard Remote “FAR SYNC” implementation

Problem Statement

What is PeopleSoft Integration Broker?

Integration Broker asynchronous messaging architecture

Diagnosis

LGWR Trace Excerpts

TIMEOUT messages from LGWR trace file

Excerpts from lfsdiag output

How to Trace NSS/RFS?

Excerpts from NSS/RFS trace file

Queries to verify IB status

What is ORDERED/UNORDERED QUEUE?

How IB Message Flow’s?

TOP SQL’s

So, What Led to this….????

Execution drift with REMOTE FAR SYNC

Solution Implemented

Like this:

1 thought on “PeopleSoft “Integration Broker” Performance Issue after Dataguard Remote “FAR SYNC” implementation”

Leave a Reply Cancel reply

Problem Statement

What is PeopleSoft Integration Broker?

Integration Broker asynchronous messaging architecture

Diagnosis

LGWR Trace Excerpts

TIMEOUT messages from LGWR trace file

Excerpts from lfsdiag output

How to Trace NSS/RFS?

Excerpts from NSS/RFS trace file

Queries to verify IB status

What is ORDERED/UNORDERED QUEUE?

How IB Message Flow’s?

TOP SQL’s

So, What Led to this….????

Execution drift with REMOTE FAR SYNC

Solution Implemented

Share this:

Like this:

1 thought on “PeopleSoft “Integration Broker” Performance Issue after Dataguard Remote “FAR SYNC” implementation”

Leave a Reply Cancel reply