RAC Node Eviction after RFS processes being killed

Problem Statement

In Real application Cluster configuration, node eviction seen when RFS processes being killed after “Possible network disconnect errors”

Diagnosis
Excerpts from Database alert log

Perceiving database alert log, few of the RFS processes being killed due to “Possible network disconnect errors” and media recovery being stuck.

 
 Killing 1 processes (PIDS:55848) (RFS ASYNC by thread) in order to receive ASYNC. Requested by OS process 57055 on instance 2 
 2020-02-06T19:24:44.791877+05:30
 RFS[14]: Assigned to RFS process (PID:57055)
 RFS[14]: Possible network disconnect with primary database
 2020-02-06T19:29:34.113157+05:30

Media recovery restarted but it didn’t progress.

 
 2020-02-07T13:15:16.772051+05:30
 ALTER DATABASE RECOVER  managed standby database cancel
 2020-02-07T13:15:16.776022+05:30
 MRP0: Background Media Recovery cancelled with status 16037
 2020-02-07T13:15:16.776362+05:30
 Errors in file /oradumps/demodb/diag/diag/rdbms/demodb/DEMODB1/trace/DEMODB1_pr00_14192.trc:
 ORA-16037: user requested cancel of managed recovery operation
 Completed: ALTER DATABASE RECOVER  managed standby database cancel
 2020-02-07T13:15:25.748077+05:30

 ALTER DATABASE RECOVER  managed standby database disconnect from session noparallel
 2020-02-07T13:15:25.757496+05:30

 Completed: ALTER DATABASE RECOVER  managed standby database disconnect from session noparallel 
 2020-02-07T13:22:12.078578+05:30

I/O errors were seen in the database alert log without any issue at the storage layer.

 
 2020-02-07T13:36:33.923855+05:30
 WARNING: Write Failed. group:2 disk:3 AU:252925 offset:0 size:1048576
 path:/dev/ASMDISKS/DEMO_ARCH_ST1_04
 		 incarnation:0x691770af asynchronous result:'I/O error'
 		 subsys:System krq:0xffffffff7d4c1ed8 bufp:0xffffffff774cf000 osderr1:0x434c5344 osderr2:0x0
 		 IO elapsed time: 0 usec Time waited on I/O: 0 usec
 WARNING: Write Failed. group:2 disk:0 AU:252920 offset:0 size:1048576
 path:/dev/ASMDISKS/DEMO_ARCH_ST1_01
 		 incarnation:0x691770ac asynchronous result:'I/O error'
 		 subsys:System krq:0xffffffff7d044c88 bufp:0xffffffff7d046000 osderr1:0x434c5344 osderr2:0x0 
 		 IO elapsed time: 0 usec Time waited on I/O: 0 usec 

An instance shutdown was initiated but that too hung , eventually leading to an instance crash.

 
 2020-02-07T14:12:13.248903+05:30
 Shutting down instance (immediate) (OS id: 2682)
 2020-02-07T14:12:15.835692+05:30
 
 2020-02-07T14:12:17.853374+05:30
 Errors in file /oradumps/demodb/diag/diag/rdbms/demodb/DEMODB1/trace/DEMODB1_pr1i_60842.trc:
 ORA-01089: immediate shutdown or close in progress - no operations are permitted
 
 USER (ospid: 5758): terminating the instance
 2020-02-07T14:22:19.181348+05:30
 Termination issued to instance processes. Waiting for the processes to exit, wait time 5 sec 
 2020-02-07T14:22:24.185841+05:30
 
 2020-02-07T14:22:35.196740+05:30
 Instance termination failed to kill one or more processes
 2020-02-07T14:22:35.197859+05:30
 Instance terminated by USER, pid = 5758 

Attempt to start the instance failed with error ORA-00304

 
 Starting ORACLE instance (normal) (OS id: 6064)
 2020-02-07T14:23:41.750858+05:30
 CLI notifier numLatches:97 maxDescs:8307
 2020-02-07T14:23:41.753893+05:30
 
 Error: Shutdown in progress. Error: 304.
 USER (ospid: 6064): terminating the instance due to error 304
 2020-02-07T14:24:20.389612+05:30
 Instance terminated by USER, pid = 6064
 
 grid@demohost01:~$ oerr ora 304
 00304, 00000, "requested INSTANCE_NUMBER is busy"
 // *Cause:  An instance tried to start by using a value of the
 //          initialization parameter INSTANCE_NUMBER that is already in use. 
 // *Action: Either
 //       a) specify another INSTANCE_NUMBER,
 //       b) shut down the running instance with this number
 //       c) wait for instance recovery to complete on the instance with
 //          this number.

Moments after this error surfaced, the server crashed.
We started investigating the reboot problem by looking at cluster logs.

Excerpts from CRS alert log
 
 2020-02-07 14:24:57.921 [OCSSD(2720)]CRS-1608: This node was evicted by node 2, demohost02; details at (:CSSNM00005:) in /GRIDHOME/app/grid/diag/crs/demohost01/crs/trace/ocssd.trc.
 2020-02-07 14:24:57.921 [OCSSD(2720)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /GRIDHOME/app/grid/diag/crs/demohost01/crs/trace/ocssd.trc
 2020-02-07 14:24:57.924 [OCSSD(2720)]CRS-1652: Starting clean up of CRSD resources.
 2020-02-07 14:25:02.792 [OCSSD(2720)]CRS-1654: Clean up of CRSD resources finished successfully.
 2020-02-07 14:25:02.793 [OCSSD(2720)]CRS-1655: CSSD on node demohost01 detected a problem and started to shutdown.
 2020-02-07 14:25:02.798 [ORAROOTAGENT(6465)]CRS-5822: Agent '/GRIDHOME/ora12201/app/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:3:4} in /GRIDHOME/app/grid/diag/crs/demohost01/crs/trace/crsd_orarootagent_root.trc. 
 2020-02-07 14:25:02.963 [CRSD(6572)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 6572
 2020-02-07 14:25:04.119 [CRSD(6572)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-23: Error in cluster services layer]. Details at (:CRSD00111:) in /GRIDHOME/app/grid/diag/crs/demohost01/crs/trace/crsd.trc.
 .
 .
 2020-02-07 14:25:11.362 [CSSDMONITOR(2328)]CRS-1661: The CSS daemon is not responding. Reboot will occur in 13797 milliseconds; Details at (:CLSN00111:) in /GRIDHOME/app/grid/diag/crs/demohost01/crs/trace/ohasd_cssdmonitor_root.trc
 2020-02-07 14:25:11.374 [CSSDAGENT(2718)]CRS-1661: The CSS daemon is not responding. Reboot will occur in 13762 milliseconds; Details at (:CLSN00111:) in /GRIDHOME/app/grid/diag/crs/demohost01/crs/trace/ohasd_cssdagent_root.trc
 2020-02-07 14:25:12.026 [CRSD(6752)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-23: Error in cluster services layer]. Details at (:CRSD00111:) in /GRIDHOME/app/grid/diag/crs/demohost01/crs/trace/crsd.trc.
 2020-02-07 14:25:12.178 [CRSD(6780)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 6780
 2020-02-07 14:25:13.340 [CRSD(6780)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-23: Error in cluster services layer]. Details at (:CRSD00111:) in /GRIDHOME/app/grid/diag/crs/demohost01/crs/trace/crsd.trc.
 .
 .
 2020-02-07 14:25:15.983 [OHASD(1792)]CRS-2771: Maximum restart attempts reached for resource 'ora.crsd'; will not restart.
 2020-02-07 14:25:18.271 [CSSDMONITOR(2328)]CRS-1661: The CSS daemon is not responding. Reboot will occur in 6896 milliseconds; Details at (:CLSN00111:) in /GRIDHOME/app/grid/diag/crs/demohost01/crs/trace/ohasd_cssdmonitor_root.trc
 2020-02-07 14:25:18.277 [CSSDAGENT(2718)]CRS-1661: The CSS daemon is not responding. Reboot will occur in 6860 milliseconds; Details at (:CLSN00111:) in /GRIDHOME/app/grid/diag/crs/demohost01/crs/trace/ohasd_cssdagent_root.trc

Above highlighted content indicates reboot initiated by cssdmonitor daemon.

Excerpts from ohasd_cssdmonitor_root.trc
 
 /GRIDHOME/app/grid/diag/crs/demohost01/crs/trace/ohasd_cssdmonitor_root.trc
 
 2020-02-07 14:24:26.322 : USRTHRD:18:  clsnkillagent_main:killreq received:
 
 2020-02-07 14:24:26.322 : USRTHRD:18:  clsskdKillMembers: kill status 0 pid 55848
 2020-02-07 14:24:26.322 : USRTHRD:18:  [clsuSlosFormatDiag called with non-error slos.]
 
 2020-02-07 14:24:26.322 : USRTHRD:18:  clsskdKillMembers: kill status 0 pid 4596
 2020-02-07 14:24:26.322 : USRTHRD:18:  [clsuSlosFormatDiag called with non-error slos.]
 
 2020-02-07 14:24:26.322 : USRTHRD:18:  clsskdKillMembers: kill status 0 pid 25827
 2020-02-07 14:24:26.322 : USRTHRD:18:  [clsuSlosFormatDiag called with non-error slos.]
 
 2020-02-07 14:24:26.322 : USRTHRD:18:  clsskdKillMembers: kill status 0 pid 32586
 2020-02-07 14:24:26.322 : USRTHRD:18:  [clsuSlosFormatDiag called with non-error slos.]
 
 2020-02-07 14:25:11.273 : USRTHRD:16:  (:CLSN00121:)clsnproc_reboot: Impending reboot at 50% of limit 27779; disk timeout 27779, network timeout 27501, 
 last heartbeat from CSSD at epoch seconds 1581065697.291, 13982 milliseconds ago based on invariant clock 130264115; now polling at 100 ms
 2020-02-07 14:25:18.174 : USRTHRD:16:  (:CLSN00121:)clsnproc_reboot: Impending reboot at 75% of limit 27779; disk timeout 27779, network timeout 27501, 
 last heartbeat from CSSD at epoch seconds 1581065697.291, 20883 milliseconds ago based on invariant clock 130264115; now polling at 100 ms
 2020-02-07 14:25:22.375 : USRTHRD:16:  (:CLSN00121:)clsnproc_reboot: Impending reboot at 90% of limit 27779; disk timeout 27779, network timeout 27501,  
 last heartbeat from CSSD at epoch seconds 1581065697.291, 25084 milliseconds ago based on invariant clock 130264115; now polling at 100 ms

After database crashed, process cleanup (KILL) was initiated by cluster which can be seen in ohasd_cssdmonitor_root.trc trace file.
All killed PID’s were database RFS processes.

 
 $ cd /oradumps/demodb/diag/diag/rdbms/demodb/DEMODB1/trace/
 $ egrep -i '55848|4596|25827|32586|17557|28049|32362|64698' alert_DEMODB1.log | grep ASYNC 
 Killing 1 processes (PIDS:55848) (RFS ASYNC by thread) in order to receive ASYNC. Requested by OS process 57055 on instance 2
 Killing 1 processes (PIDS:4596)  (RFS ASYNC by thread) in order to receive ASYNC. Requested by OS process 6173  on instance 2
 Killing 1 processes (PIDS:25827) (RFS ASYNC by thread) in order to receive ASYNC. Requested by OS process 27713 on instance 2
 Killing 1 processes (PIDS:32586) (RFS ASYNC by thread) in order to receive ASYNC. Requested by OS process 35056 on instance 2

Looking at server messages file, panic was initiated at Feb 7 14:25:25.
Crash dump /var/crash/data/708ca07a-331a-4e8c-8256-a1cf5a2f24f7 has been generated automatically.

 
 Feb  7 14:25:25 demohost01 ^Mpanic[cpu37]/thread=18400d0f05500:
 Feb  7 14:25:25 demohost01 unix: [ID 493256 kern.notice] Oracle RAC node eviction
 Feb  7 14:25:25 demohost01 unix: [ID 100000 kern.notice]
 
 Feb  7 14:26:24 demohost01 savecore: [ID 570001 daemon.notice] reboot after panic: Oracle RAC node eviction
 Feb  7 14:26:24 demohost01 savecore: [ID 702911 daemon.warning] System dump time: Fri Feb  7 14:25:26 2020
 Feb  7 14:26:24 demohost01 savecore: [ID 702911 daemon.warning] Dump found on /dev/defdump
 Feb  7 14:26:25 demohost01 savecore: [ID 702911 daemon.error] System startup delayed until crashdump is saved to disk
 Feb  7 14:26:25 demohost01 savecore: [ID 702911 daemon.error] Saving compressed system crash dump files in directory /var/crash/data/708ca07a-331a-4e8c-8256-a1cf5a2f24f7
 
 Feb  7 14:26:49 demohost01 savecore: [ID 702911 daemon.error] Decompress all crash dump files with '(cd /var/crash/data/708ca07a-331a-4e8c-8256-a1cf5a2f24f7 && savecore 21)' 
                 or individual files with 'savecore -f /var/crash/data/708ca07a-331a-4e8c-8256-a1cf5a2f24f7/vmdump{,-<secname>}.21' 
 Feb  7 14:26:49 demohost01 fmd: [ID 377184 daemon.notice] SUNW-MSG-ID: SAVECORE-8000-1M, TYPE: Alert, VER: 1, SEVERITY: Minor
 Feb  7 14:26:49 demohost01 EVENT-TIME: Fri Feb  7 14:26:49 IST 2020
 Feb  7 14:26:49 demohost01 PLATFORM: SPARC-M8-8, CSN: unknown, HOSTNAME: demohost01
 Feb  7 14:26:49 demohost01 SOURCE: software-diagnosis, REV: 0.2
 Feb  7 14:26:49 demohost01 EVENT-ID: c2e3d5a1-a0cb-40fc-9981-8387b2b5c76d
 Feb  7 14:26:49 demohost01 DESC: savecore(8) successfully extracted crash dump from dump device with OS image UUID 708ca07a-331a-4e8c-8256-a1cf5a2f24f7 
                 into directory /var/crash/data/708ca07a-331a-4e8c-8256-a1cf5a2f24f7

Reading Solaris Crash dump using SCAT

We started investigating the crash dump using SCAT utility.
Oracle Solaris Crash Analysis Tool(SCAT) (Doc ID 1381679.2)

 
 cd /var/crash/data/708ca07a-331a-4e8c-8256-a1cf5a2f24f7 
 savecore -vf vmdump.21

 
 root@demohost01:/var/crash/data/708ca07a-331a-4e8c-8256-a1cf5a2f24f7# scat 21
 
   Oracle Solaris Crash Analysis Tool
   Version 5.5.1beta for Oracle Solaris 11 64-bit UltraSPARC
 
   Copyright (c) 1989, 2020, Oracle and/or its affiliates. All rights reserved.
 
   Please note: Do not submit any health, payment card or other sensitive
   production data that requires protections greater than those specified in
   the Oracle GCS Security Practices.  Information on how to remove data from
   your submission is available at:
 	https://support.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1227943.1 
 
   For support, please use the Oracle Solaris kernel community at
   https://community.oracle.com/community/support/oracle_sun_technologies/
   Select "Subspaces" and then "Oracle Solaris Performance, Panics,
   Hangs, and Dtrace".
   Further information may be found at https://blogs.oracle.com/SolarisCAT/
 
 opening vmcore.21 ...dumphdr...
 NOTE: Not all coredump section files loaded - 1 missing.
 symtab...maps...done
 loading crashdump data: modules...CTF...globals...done
 
 crash file:     /var/crash/data/708ca07a-331a-4e8c-8256-a1cf5a2f24f7/vmcore.21
 user:           Super-User (root:0)
 release:        5.11 (64-bit)
 version:        11.4.13.4.0
  usr/src:       38757:9c92ed95cfb3:11.4.13.0.1.4.0+0
  usr/closed:    2485:8ad50161d8ee:11.4.13.0.1.4.0+0
 machine:        sun4v
 node name:      demohost01
 hw_provider:    Oracle Corporation
 system type:    ORCL,SPARC-M8-8 (SPARC-M8)
 hostid:         84f911c0
 dump_conflags:  0x10100 (DUMP_KERNEL|DUMP_ZFS) on /dev/zvol/dsk/rpool/dump(1M)
 dump_uuid:      708ca07a-331a-4e8c-8256-a1cf5a2f24f7
 time of crash:  Fri Feb  7 14:25:25 IST 2020
 age of system:  22 hours 3 minutes 43 seconds
 panic CPU:      37 (128 CPUs, 256G memory, 17 nodes)
 panic string:   Oracle RAC node eviction
 
 sanity checks: settings...vmem...CPU...sysent...clock...misc...
 NOTE: system has 2 non-default projects
 done

Panic initiated by process PID: 2328(cssdmonitor)

 
 CAT(vmcore.21/11V)> panic
 panic on CPU 37
 panic string:   Oracle RAC node eviction
 ==== panic user (LWP_SYS) thread: 0x18400d0f05500  PID: 2328  on CPU: 37
 cmd: /GRIDHOME/ora12201/app/grid/bin/cssdmonitor
 fmri: lrc:/etc/rc3_d/S96ohasd
 t_procp: 0x18400e71be3a8
    p_as: 0x18400e645ee78  size: 184508416  RSS: 122634240
       a_hat: 0x18400e5ec05c0
       cnum: CPU0:6/102 CPU24:4/98 CPU40:2/198 CPU56:3/111 CPU80:4/105 CPU112:3/51
       cpusran: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
               27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,
               51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,
               75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,
               99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,
               117,118,119,120,121,122,123,124,125,126,127
    p_zone: 0x209018c0 (global)
 t_stk: 0x2a112b5ba80  sp: 0x2a112b5b031  t_stkbase: 0x2a112b54000
 t_pri: 110 (RT)  pctcpu: 0.008073
 t_transience: 10 (TRANSIENT)
 t_lwp: 0x18400d29c4d40  t_tid: 16
    machpcb: 0x2a112b5ba80
    lwp_ap:   0x2a112b5bb70
    t_mstate: LMS_SYSTEM  ms_prev: LMS_USER
    ms_state_start: 0.000148530 seconds earlier
    ms_start: 22 hours 2 minutes 14.838736230 seconds earlier
 t_cpupart: 0x20551a70(0)  last CPU: 37
 idle: 207750 nsec (0.000207750s)
 start: Thu Feb  6 16:23:10 2020
 age: 79335 seconds (22 hours 2 minutes 15 seconds)
 t_state:     TS_ONPROC
 t_flag:      0x1800(T_PANIC|T_LWPREUSE)
 t_proc_flag: 0x100 (TP_MSACCT)
 t_schedflag: 0x13 (TS_LOAD|TS_DONT_SWAP|TS_SIGNALLED)
 t_acflag:    3 (TA_NO_PROCESS_LOCK|TA_BATCH_TICKS)
 p_flag:      0x4a004000 (SEXECED|SMSACCT|SAUTOLPG|SMSFORK)
 
 pc:      unix:panic+0x1c:   call        unix:vpanic
 
 void unix:panic+0x1c((const char *)0x10818aa8, 0x10818800, 0x10818800, 0x20885800, 0x18400da864d58, 0x10818800, ...)
 int genunix:kadmin+0x6cc((int)0x2012d400?, (int)1, (void *)0, (cred_t *)0x18400b7300a98)
 int genunix:uadmin+0x1bc((int)5, (int)1, (uintptr_t))
 unix:_syscall_no_proc_exit+0x5c()
 -- switch to user thread's user stack --

Thread summary from crash dump

 
 CAT(vmcore.21/11V)> thread summary
 		reference clock = panic_lbolt: 0x7930b0, panic_hrtime: 0x5c03fff26e8da2 
      429   threads ran since 1 second before current tick (129 user, 300 kernel)
     3187   threads ran since 1 minute before current tick (533 user, 2654 kernel)
 
 	1   TS_RUN threads (1 user, 0 kernel)
 	1   TS_STOPPED threads (0 user, 1 kernel)
 	0   !TS_LOAD (swapped) threads
 
 	0   threads trying to get a mutex
 	0   threads trying to get an rwlock
     5392   threads waiting for a condition variable (1068 user, 4324 kernel)
 	0   threads sleeping on a semaphore
      187   threads sleeping on a user-level sobj (187 user, 0 kernel)
 	0   threads waiting for a UPI mutex
      125   threads sleeping on a shuttle (door) (124 user, 1 kernel)
 
 	0   threads in biowait()
 	8*  procs with SIGKILL posted (see "tlist killed")
       16*  threads with procs with SIGKILL posted (16 user, 0 kernel)
       40   threads in vdc_recv() (0 user, 40 kernel)
 
 	1   threads in dispatch queues (1 user, 0 kernel)
 	1*  threads in dispq of CPU running idle thread (1 user, 0 kernel)

As highlighted, there were 8 procs with SIGKILL posted.
Details can be listed using tlist killed

KILLED process stack – PID 55848

 
 CAT(vmcore.21/11V)> tlist killed
 ==== user (LWP_SYS) thread: 0x184011486d840  PID: 55848
 cmd: oracleDEMODB1 (LOCAL=NO)
 fmri: svc:/network/ssh:default
 t_wchan: 0x184010b23a434  sobj: condition var (from genunix:closeandsetf_flags+0x314)
 t_procp: 0x184010ffe8330
    p_as: 0x184010910a2e0  size: 77606789120  RSS: 50995200
 	  a_hat: 0x1840108f64300
 	  cnum: CPU0:6/2949 CPU24:4/32335 CPU40:2/63702 CPU56:1/38089 CPU80:1/48277 CPU112:1/27649
 	  cpusran: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
                   30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,
                   56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,
                   82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,
                   106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,
                   126,127
    p_zone: 0x209018c0 (global)
 t_stk: 0x2a100d13a80  sp: 0x2a100d13051  t_stkbase: 0x2a100d0c000
 t_pri: 59 (TS)  pctcpu: 0.000000
 t_transience: 0
 t_lwp: 0x18401118fce70  t_tid: 1
    machpcb: 0x2a100d13a80
    lwp_ap:   0x18401118fcf28
    t_mstate: LMS_SLEEP  ms_prev: LMS_SYSTEM
    ms_state_start: 19 hours 1 minutes 43.509349620 seconds earlier
    ms_start: 19 hours 8 minutes 33.010578345 seconds earlier
 t_cpupart: 0x20551a70(0)  last CPU: 43
 idle: 68503509344805 nsec (19h1m43.509344805s)
 start: Thu Feb  6 19:16:52 2020
 age: 68913 seconds (19 hours 8 minutes 33 seconds)
 t_state:     TS_SLEEP
 t_flag:      0x1000(T_LWPREUSE)
 t_proc_flag: 0x104 (TP_TWAIT|TP_MSACCT)
 t_schedflag: 3 (TS_LOAD|TS_DONT_SWAP)
 t_acflag:    3 (TA_NO_PROCESS_LOCK|TA_BATCH_TICKS)
 p_flag:      0x4a304902 (SEXITING|SKILLED|SEXTKILLED|SEXECED|SEXITLWPS|SHOLDFORK|SMSACCT|SAUTOLPG|SMSFORK) 
 
 pc:      genunix:cv_wait+0x64:   call   unix:swtch
 
 void genunix:cv_wait+0x64((kcondvar_t *), (kmutex_t *)0x184010b23a400)
 int genunix:closeandsetf_flags+0x314((int)0x10, (file_t *)0, (char)0)
 int genunix:close+0xc((int))
 unix:_syscall_no_proc_exit+0x5c()
 -- switch to user thread's user stack --

Process killing was stuck on closeandsetf_flags+0x314.

So, what led to this????

We have validated support notes against OS stack closeandsetf_flags+0x314.

On systems running Solaris 11.4.10.3.0 or later, processes that use the poll(2) systemcall on BSD sockets may hang on exit waiting for a reference count to drop.
The stack trace will be similar to the following:

void genunix:cv_wait()
int genunix:closeandsetf_flags()
int genunix:close()
unix:_syscall_no_proc_exit()

Reference Document
Solaris 11.4 Processes That Use the poll(2) systemcall on BSD Sockets may Hang on Exit (Doc ID 2600420.1)

Solution Implemented

This issue is addressed in the SRU release Solaris 11.4.15.5.0 or later.
We applied SRU release 11.4.17.3.0 and node eviction issue has been resolved.

 
 pkg info entire
              Name: entire
           Summary: entire incorporation including Support Repository Update
                    (Oracle Solaris 11.4.17.3.0).
       Description: This package constrains system package versions to the same 
                    build.  WARNING: Proper system update and correct package
                    selection depend on the presence of this incorporation.
                    Removing this package will result in an unsupported system.
                    For more information see:
                    https://support.oracle.com/rs?type=doc&id=2433412.1
          Category: Meta Packages/Incorporations
             State: Installed
         Publisher: solaris
           Version: 11.4 (Oracle Solaris 11.4.17.3.0)
            Branch: 11.4.17.0.1.3.0
    Packaging Date: Fri Dec 20 20:23:47 2019
 Last Install Time: Fri Dec 13 10:35:54 2019
  Last Update Time: Wed Feb 12 09:34:45 2020
              Size: 2.52 kB
              FMRI: pkg://solaris/[email protected]:20191220T202347Z

 

1 thought on “RAC Node Eviction after RFS processes being killed”

Leave a Reply

Your email address will not be published. Required fields are marked *