Table of Contents
Problem Statement
In Real application Cluster setup, SCAN LISTENER running on 1st node could only register local instances and not the remote instances, however the scan listener running on Node 2 could register instances from both nodes, as expected.
Diagnosis
One fine day, we observed load balancing issues inside the database which compelled us to check if there were connections from application which hit a specific node. Post ruling out any node bound connections, we surmised some hitch at the LISTENER level.
Local listener registration list seemed fine, however, SCAN LISTENER registration list was a source of amazement.
For Scan listeners running on Node1, only instances from Node1 were listed in the registration list, as against on Node2 where all the instances (from both nodes) were visible.
Node1
lsnrctl stat LISTENER_SCAN1 LSNRCTL for IBM/AIX RISC System/6000: Version 19.0.0.0.0 - Production on 15-NOV-2021 20:36:34 Copyright (c) 1991, 2021, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN1))) STATUS of the LISTENER ------------------------ Alias LISTENER_SCAN1 Version TNSLSNR for IBM/AIX RISC System/6000: Version 19.0.0.0.0 - Production Start Date 15-NOV-2021 20:29:58 Uptime 0 days 0 hr. 6 min. 35 sec Trace Level off Security ON: Local OS Authentication SNMP OFF Listener Parameter File /GRIDHOME/oracle/app/product/grid/19.3.0/network/admin/listener.ora Listener Log File /GRIDHOME/oracle/orabase/diag/tnslsnr/Node1/listener_scan1/alert/log.xml Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_SCAN1))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.xx.xx.xx)(PORT=1900))) Services Summary... Service "-MGMTDBXDB" has 1 instance(s). Instance "-MGMTDB", status READY, has 1 handler(s) for this service... Service "8939c3a49bbd01c2e0530af11a196e44" has 1 instance(s). Instance "-MGMTDB", status READY, has 1 handler(s) for this service... Service "DEMODB" has 1 instance(s). Instance "DEMODB1", status READY, has 40 handler(s) for this service... Service "_mgmtdb" has 1 instance(s). Instance "-MGMTDB", status READY, has 1 handler(s) for this service... Service "b58859d4571d06a0e0530a323030d700" has 1 instance(s). Instance "-MGMTDB", status READY, has 1 handler(s) for this service... Service "gimr_dscrep_10" has 1 instance(s). Instance "-MGMTDB", status READY, has 1 handler(s) for this service... The command completed successfully
Node2
lsnrctl stat LISTENER_SCAN2 LSNRCTL for IBM/AIX RISC System/6000: Version 19.0.0.0.0 - Production on 15-NOV-2021 20:37:16 Copyright (c) 1991, 2021, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN2))) STATUS of the LISTENER ------------------------ Alias LISTENER_SCAN2 Version TNSLSNR for IBM/AIX RISC System/6000: Version 19.0.0.0.0 - Production Start Date 15-NOV-2021 20:36:16 Uptime 0 days 0 hr. 1 min. 0 sec Trace Level off Security ON: Local OS Authentication SNMP OFF Listener Parameter File /GRIDHOME/oracle/app/product/grid/19.3.0/network/admin/listener.ora Listener Log File /GRIDHOME/oracle/orabase/diag/tnslsnr/Node2/listener_scan2/alert/log.xml Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_SCAN2))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.xx.xx.xx)(PORT=1900))) Services Summary... Service "-MGMTDBXDB" has 1 instance(s). Instance "-MGMTDB", status READY, has 1 handler(s) for this service... Service "8939c3a49bbd01c2e0530af11a196e44" has 1 instance(s). Instance "-MGMTDB", status READY, has 1 handler(s) for this service... Service "DEMODB" has 2 instance(s). Instance "DEMODB1", status READY, has 40 handler(s) for this service... Instance "DEMODB2", status READY, has 40 handler(s) for this service... Service "_mgmtdb" has 1 instance(s). Instance "-MGMTDB", status READY, has 1 handler(s) for this service... Service "b58859d4571d06a0e0530a323030d700" has 1 instance(s). Instance "-MGMTDB", status READY, has 1 handler(s) for this service... Service "gimr_dscrep_10" has 1 instance(s). Instance "-MGMTDB", status READY, has 1 handler(s) for this service... The command completed successfully
Preliminary exercises of restarting and relocating the SCAN listeners did not help.
Even Restarting CRS along with a Node reboot didn’t come to our rescue.
To circumvent this, we relocated all scan listeners to Node2 to avoid load balancing issues within the database.
To further resolve the issue, we configured separate TNS entry using all SCAN listeners and set remote listener parameter referring below support document.
SCAN Registration Issue: Services Fail to Register to SCAN Listeners (Doc ID 1448717.1)
REMOTE_LISTENERS_SCAN= (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = 10.xx.xx.x1)(PORT = 1900)) (ADDRESS = (PROTOCOL = TCP)(HOST = 10.xx.xx.x2)(PORT = 1900)) (ADDRESS = (PROTOCOL = TCP)(HOST = 10.xx.xx.x3)(PORT = 1900)) ) alter system set REMOTE_LISTENER='REMOTE_LISTENERS_SCAN' scope=both sid='*'; alter system register;
Issue still continues to haunt us.
Now…… What next????
In this setup, logging has been disabled proactively for all the listeners including SCAN.
Without Listener logging , it was onerous to understand the nature of errors and therefore isolate the problem.
As per regulatory compliance DR drill has been planned for this database.
This gave us privilege to enable listener logging at problematic cluster setup.
Enabling Scan Listener logging too didn’t help as there were no anomalies recorded.
Upon further studying and probing the issue , we came across the following document to perform additional tracing of the listener registration.
How to Trace Dynamic Registration from PMON or LREG (12c) (Doc ID 787055.1)
What is LREG????
In 12c and higher, there is a new background process called ora_lreg_DEMODB1
In previous releases, PMON always handled instance registration logic. In 12c, LREG (Listener REGistration) now takes over with the instance registration logic.
LREG registers instance information with the listener.
It is a critical background process for each database instance (if it is killed, Oracle goes down).
It does everything PMON used to do and report service_update, service_register, service_died in the listener.log.
How to Enable LREG trace
alter system set events = ‘immediate trace name LREG_STATE level 3’;
How to Disable?
alter system set events = ‘immediate trace name LREG_STATE off’;
We have seen some interesting errors in LREG trace file for SCAN listener running on Node1.
Remote listeners: 0 - (DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=))(ADDRESS=(PROTOCOL=tcp)(HOST=10.xx.xx.x1)(PORT=1900))) pdb 0 dci 0 flg 0x22 firewall=0, publish_service_acl=0 state=4, err=3 nse[0]=12535, nse[1]=12579, nte[0]=0, nte[1]=0, nte[2]=0 ncre=0 endp=(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.xx.xx.x1)(PORT=1900))) flg=0x80000000 nse=0 pri=0x122358628 1 - (DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=))(ADDRESS=(PROTOCOL=tcp)(HOST=10.xx.xx.x2)(PORT=1900))) pdb 0 dci 0 flg 0x22 firewall=0, publish_service_acl=0 state=1, err=0 nse[0]=0, nse[1]=0, nte[0]=0, nte[1]=0, nte[2]=0 ncre=0 endp=(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.xx.xx.x2)(PORT=1900))) flg=0x80000000 nse=0 pri=0x122358078 2 - (DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=))(ADDRESS=(PROTOCOL=tcp)(HOST=10.xx.xx.x3)(PORT=1900))) pdb 0 dci 0 flg 0x22 firewall=0, publish_service_acl=0 state=1, err=0 nse[0]=0, nse[1]=0, nte[0]=0, nte[1]=0, nte[2]=0 ncre=0 endp=(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.xx.xx.x3)(PORT=1900))) flg=0x80000000 nse=0 pri=0x122357b20
Error Details
oerr tns 12535 12535, 00000, "TNS:operation timed out" // *Cause: The requested operation could not be completed within the time out // period. // *Action: Look at the documentation on the secondary errors for possible // remedy. See SQLNET.LOG to find secondary error if not provided explicitly. // Turn on tracing to gather more information. oerr tns 12579 12579, 00000, "TNS:transport connection failed" // *Cause: An error was encountered while trying to create transport connection. // *Action: Likely reasons include host may be down, outbound transport // connection may be blocked by network or any other transport connectivity // issue. // If error persists, contact Oracle Customer Support.
So, what led to this????
The errors led us to yet another document.
ORA-12514 & Services Fail to Register to SCAN Listeners After HAIP Offline (Doc ID 2500067.1)
Oracle 12C introduces a new parameter: REMOTE_REGISTRATION_ADDRESS_<listener_name> which is configured internally in the SCAN listener to restrict registrations to the private network.
By default, it uses HAIP as remote registration address.
On problematic Node1, it was using HAIP address where as on Node2 it was private IP
Node1
lsnrctl show remote_registration_address listener_scan1 LSNRCTL for IBM/AIX RISC System/6000: Version 19.0.0.0.0 - Production on 15-NOV-2021 20:22:45 Copyright (c) 1991, 2021, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN1))) LISTENER parameter "remote_registration_address" set to (DESCRIPTION=(ASYNC_TIMER=yes)(EXPIRE_TIME=1)(TRANSPORT_CONNECT_TIMEOUT=15)(ADDRESS_LIST=(ADDRESS=(PROTOCOL=tcp)(HOST=169.254.46.147)(PORT=49925)))) The command completed successfully
Node2
lsnrctl show remote_registration_address listener_scan2 LSNRCTL for IBM/AIX RISC System/6000: Version 19.0.0.0.0 - Production on 15-NOV-2021 20:23:55 Copyright (c) 1991, 2021, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN2))) LISTENER parameter "remote_registration_address" set to (DESCRIPTION=(ASYNC_TIMER=yes)(EXPIRE_TIME=1)(TRANSPORT_CONNECT_TIMEOUT=15)(ADDRESS_LIST=(ADDRESS=(PROTOCOL=tcp)(HOST=10.10.10.30)(PORT=52350)))) The command completed successfully lsnrctl show remote_registration_address listener_scan3 LSNRCTL for IBM/AIX RISC System/6000: Version 19.0.0.0.0 - Production on 15-NOV-2021 20:24:03 Copyright (c) 1991, 2021, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN3))) LISTENER parameter "remote_registration_address" set to (DESCRIPTION=(ASYNC_TIMER=yes)(EXPIRE_TIME=1)(TRANSPORT_CONNECT_TIMEOUT=15)(ADDRESS_LIST=(ADDRESS=(PROTOCOL=tcp)(HOST=10.10.10.30)(PORT=52384)))) The command completed successfully
When we look at HAIP status, it was in OFFLINE state on Node2 (Good Node but actual evil)
Node1
crsctl status res -t -init -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.cluster_interconnect.haip 1 ONLINE ONLINE Node1 STABLE --------------------------------------------------------------------------------
Node2
crsctl status res -t -init -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.cluster_interconnect.haip 1 ONLINE OFFLINE STABLE --------------------------------------------------------------------------------
Solution Implemented
Since the HAIP was not functioning on Node2, instance registration request from Node2 could not reach Node1. This is why Node2 instances were not listed in registration list on Node1.
Starting the HAIP on Node2 resolved the issue.
As a long-term and permanent resolution, we also disabled REMOTE_REGISTRATION_ADDRESS in listener.ora file for all SCAN listeners.
REMOTE_REGISTRATION_ADDRESS_LISTENER_SCAN1=OFF REMOTE_REGISTRATION_ADDRESS_LISTENER_SCAN2=OFF REMOTE_REGISTRATION_ADDRESS_LISTENER_SCAN3=OFF
SCAN listeners were restarted and services from all instances were visible in all SCAN listeners.
Explained very well.thank you for sharing
Quite an interesting and puzzling issue. Well documented and lucidly explained.