RAC Failure Scenario - Testing in our RAC Test DB - Node 1 & Node 2

RAC Failure Scenario - Testing in our RAC Test DB - Node 1 & Node 2

Testing Various RAC Failure scenario like:

1 - Node Failure
2 - DB Instance Failure
3 - ASM Instance Failure
4 - Listener failure
5 - Public Network Failure
6 - Private Network Failure
7 - OCR & Voting Disk Failure
8 - ASM Disk Failure

==============
1 - Node Failure

==============
It can be planned, unplanned or all the nodes it could be any scenario

Let us start the workload & shutdown the node

The expected result will be:
  • Resources will go offline
  • Instance recovery will be performed
  • Node vip & scan vip fail to surviving node
  • Scan listener will fail over surviving node
  • Client connection are moved to surviving instance
Now shutdown Node 1 in OS Level intentionally or due to some reason the node 1 is shutdown

#Shutdown –h now

Check the node 2 status

[oracle@test-rac2 ~]$ crs_stat -t

Name           Type           Target    State     Host
------------------------------------------------------------
ora.DATA.dg    ora....up.type ONLINE    ONLINE    test-rac2
ora....ER.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora....N1.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora....N2.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora....N3.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora.asm        ora.asm.type   ONLINE    ONLINE    test-rac2
ora.cvu        ora.cvu.type   ONLINE    ONLINE    test-rac2
ora.gsd        ora.gsd.type   OFFLINE   OFFLINE
ora....network ora....rk.type ONLINE    ONLINE    test-rac2
ora.oc4j       ora.oc4j.type  ONLINE    ONLINE    test-rac2
ora.ons        ora.ons.type   ONLINE    ONLINE    test-rac2
ora.rac.db     ora....se.type ONLINE    ONLINE    test-rac2
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora.scan2.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora.scan3.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora....ac1.vip ora....t1.type ONLINE    ONLINE    test-rac2
ora....SM2.asm application    ONLINE    ONLINE    test-rac2
ora....C2.lsnr application    ONLINE    ONLINE    test-rac2
ora....ac2.gsd application    OFFLINE   OFFLINE
ora....ac2.ons application    ONLINE    ONLINE    test-rac2
ora....ac2.vip ora....t1.type ONLINE    ONLINE    test-rac2

Self Explanation:

As you can see the above node 2 status its shows only the node 2 is running & node is failure
To get back the node 1 failure as we have shutdown, now need to restart the node 1 therefore the once restarted it will automatically perform all the operation & the database & asm instance will be started

Now after restart the node 1 – verify the crs_stat – t status:

[oracle@test-rac1 ~]$ crs_stat -t

Name           Type           Target    State     Host
------------------------------------------------------------
ora.DATA.dg    ora....up.type ONLINE    ONLINE    test-rac1
ora....ER.lsnr ora....er.type ONLINE    ONLINE    test-rac1
ora....N1.lsnr ora....er.type ONLINE    ONLINE    test-rac1
ora....N2.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora....N3.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora.asm        ora.asm.type   ONLINE    ONLINE    test-rac1
ora.cvu        ora.cvu.type   ONLINE    ONLINE    test-rac2
ora.gsd        ora.gsd.type   OFFLINE   OFFLINE
ora....network ora....rk.type ONLINE    ONLINE    test-rac1
ora.oc4j       ora.oc4j.type  ONLINE    ONLINE    test-rac2
ora.ons        ora.ons.type   ONLINE    ONLINE    test-rac1
ora.rac.db     ora....se.type ONLINE    ONLINE    test-rac1
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    test-rac1
ora.scan2.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora.scan3.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora....SM1.asm application    ONLINE    ONLINE    test-rac1
ora....C1.lsnr application    ONLINE    ONLINE    test-rac1
ora....ac1.gsd application    OFFLINE   OFFLINE
ora....ac1.ons application    ONLINE    ONLINE    test-rac1
ora....ac1.vip ora....t1.type ONLINE    ONLINE    test-rac1
ora....SM2.asm application    ONLINE    ONLINE    test-rac2
ora....C2.lsnr application    ONLINE    ONLINE    test-rac2
ora....ac2.gsd application    OFFLINE   OFFLINE
ora....ac2.ons application    ONLINE    ONLINE    test-rac2
ora....ac2.vip ora....t1.type ONLINE    ONLINE    test-rac2

For information check the log files of node 2 & node 1 for better understanding

Node 2 – Log after shutdown the node1

SMON: enabling cache recovery
Sat Apr 08 12:30:02 2017
minact-scn: Inst 2 is now the master inc#:3 mmon proc-id:6037 status:0x7
minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000
minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:3 errcnt:0
[6057] Successfully onlined Undo Tablespace 5.
Undo initialization finished serial:0 start:4294856940 end:4294858260 diff:1320 (13 seconds)
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
Database Characterset is AL32UTF8
No Resource Manager plan active
Starting background process GTX0
Sat Apr 08 12:30:05 2017
GTX0 started with pid=35, OS id=6139
Starting background process RCBG
Sat Apr 08 12:30:06 2017
RCBG started with pid=36, OS id=6141
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
Sat Apr 08 12:30:07 2017
QMNC started with pid=38, OS id=6145
Sat Apr 08 12:30:10 2017
Completed: ALTER DATABASE OPEN /* db agent *//* {1:25782:2} */
Sat Apr 08 12:30:12 2017
Starting background process CJQ0
Sat Apr 08 12:30:12 2017
CJQ0 started with pid=43, OS id=6169
Sat Apr 08 12:35:08 2017
Starting background process SMCO
Sat Apr 08 12:35:08 2017
SMCO started with pid=29, OS id=6669
Sat Apr 08 13:00:29 2017
Thread 2 advanced to log sequence 110 (LGWR switch)
  Current log# 4 seq# 110 mem# 0: +DATA/rac/onlinelog/group_4.267.937069189
Sat Apr 08 15:48:24 2017
Reconfiguration started (old inc 3, new inc 5)
List of instances:
 2 (myinst: 2)
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Sat Apr 08 15:48:24 2017
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
Sat Apr 08 15:48:24 2017
minact-scn: master found reconf/inst-rec before recscn scan old-inc#:3 new-inc#:3
 Post SMON to start 1st pass IR
Sat Apr 08 15:48:24 2017
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
Started redo scan
Completed redo scan
 read 56 KB redo, 32 data blocks need recovery
Started redo application at
 Thread 1: logseq 196, block 6940
Recovery of Online Redo Log: Thread 1 Group 2 Seq 196 Reading mem 0
  Mem# 0: +DATA/rac/onlinelog/group_2.262.937068817
Completed redo application of 0.02MB
Completed instance recovery at
 Thread 1: logseq 196, block 7053, scn 8509484
 31 data blocks read, 32 data blocks written, 56 redo k-bytes read
Thread 1 advanced to log sequence 197 (thread recovery)
minact-scn: master continuing after IR
minact-scn: Master considers inst:1 dead
Sat Apr 08 15:49:24 2017
Decreasing number of real time LMS from 1 to 0
Node 1 – After its restart/up the server
[oracle@test-rac1 ~]$ cd /u01/app/oracle/diag/rdbms/rac/rac1/trace/
[oracle@test-rac1 trace]$ tail -100 alert_rac1.log
Sat Apr 08 16:01:44 2017
DBW0 started with pid=17, OS id=5925
Sat Apr 08 16:01:45 2017
LGWR started with pid=18, OS id=5928
Sat Apr 08 16:01:45 2017
CKPT started with pid=19, OS id=5930
Sat Apr 08 16:01:45 2017
SMON started with pid=20, OS id=5932
Sat Apr 08 16:01:45 2017
RECO started with pid=21, OS id=5934
Sat Apr 08 16:01:45 2017
RBAL started with pid=22, OS id=5937
Sat Apr 08 16:01:45 2017
ASMB started with pid=23, OS id=5939
Sat Apr 08 16:01:45 2017
MMON started with pid=24, OS id=5941
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
Sat Apr 08 16:01:45 2017
MMNL started with pid=25, OS id=5944
NOTE: initiating MARK startup
starting up 1 shared server(s) ...
Starting background process MARK
Sat Apr 08 16:01:45 2017
MARK started with pid=27, OS id=5949
NOTE: MARK has subscribed
lmon registered with NM - instance number 1 (internal mem no 0)
Reconfiguration started (old inc 0, new inc 11)
List of instances:
 1 2 (myinst: 1)
 Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
 Communication channels reestablished
 * domain 0 valid according to instance 2
 * domain 0 valid = 1 according to instance 2
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Sat Apr 08 16:01:46 2017
LCK0 started with pid=30, OS id=5957
Starting background process RSMN
Sat Apr 08 16:01:46 2017
RSMN started with pid=31, OS id=5959
ORACLE_BASE not set in environment. It is recommended
that ORACLE_BASE be set in the environment
Sat Apr 08 16:01:47 2017
ALTER SYSTEM SET local_listener=' (ADDRESS=(PROTOCOL=TCP)(HOST=10.20.0.92)(PORT=1521))' SCOPE=MEMORY SID='rac1';
ALTER DATABASE MOUNT /* db agent *//* {1:64177:5} */
NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
NOTE: Loaded library: System
SUCCESS: diskgroup DATA was mounted
NOTE: dependency between database rac and diskgroup resource ora.DATA.dg is established
Successful mount of redo thread 1, with mount id 2528695933
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Lost write protection disabled
Completed: ALTER DATABASE MOUNT /* db agent *//* {1:64177:5} */
ALTER DATABASE OPEN /* db agent *//* {1:64177:5} */
Picked broadcast on commit scheme to generate SCNs
Thread 1 opened at log sequence 198
  Current log# 2 seq# 198 mem# 0: +DATA/rac/onlinelog/group_2.262.937068817
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
SMON: enabling cache recovery
[5961] Successfully onlined Undo Tablespace 2.
Undo initialization finished serial:0 start:4294800940 end:4294801440 diff:500 (5 seconds)
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
Sat Apr 08 16:01:55 2017
SMON: enabling tx recovery
Database Characterset is AL32UTF8
No Resource Manager plan active
Starting background process GTX0
Sat Apr 08 16:01:55 2017
GTX0 started with pid=35, OS id=6006
Starting background process RCBG
Sat Apr 08 16:01:56 2017
RCBG started with pid=36, OS id=6011
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
Sat Apr 08 16:01:56 2017
QMNC started with pid=37, OS id=6033
Sat Apr 08 16:01:58 2017
Completed: ALTER DATABASE OPEN /* db agent *//* {1:64177:5} */
Sat Apr 08 16:01:59 2017
minact-scn: Inst 1 is a slave inc#:11 mmon proc-id:5941 status:0x2
minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000
Sat Apr 08 16:01:59 2017
Starting background process CJQ0
Sat Apr 08 16:02:00 2017
CJQ0 started with pid=44, OS id=6066
Sat Apr 08 16:06:58 2017
Starting background process SMCO
Sat Apr 08 16:06:58 2017
SMCO started with pid=29, OS id=6514
-------------------------------------------------------------------------END-------------------------------------------

=================
2 - Instance Failure
=================

Start the workload

Shutdown the instance (Shut abort or kill the PMON Process)

Expected result:
  • Instance recovery will be performed
  •  Surviving instance will read online redo log files of the failure instance and ensures that commitment transaction are recorded into the database
  •  If all the nodes fail, one instance will perform recovery of all instances
  •  Services will be moved to available instance
  •  Client connection are moved to surviving instances
  •  Failed instance will be restarted by the clusterware automatically
Node 1 - Try to kill the PMON Process Intentionally or Shutdown abort

[oracle@test-rac1 ~]$ ps -ef | grep pmon

oracle    5505     1  0 16:01 ?        00:00:00 asm_pmon_+ASM1
oracle    5867     1  0 16:01 ?        00:00:00 ora_pmon_rac1
oracle    7859  6007  0 16:22 pts/0    00:00:00 grep pmon

[oracle@test-rac1 ~]$ kill -9 5867

[oracle@test-rac1 ~]$  ps -ef | grep pmon

oracle    5505     1  0 16:01 ?        00:00:00 asm_pmon_+ASM1
oracle    7928  6007  0 16:23 pts/0    00:00:00 grep pmon

$sqlplus / as sysdba

Sql> select instance_name,status from v$instance;

Oracle not available

Exit

=======
Node 1 – Logfile of the database after kill the pmon process, here the instance is shutdown compeletely

=======
[oracle@test-rac1 trace]$ tail -100 alert_rac1.log

Sat Apr 08 16:25:49 2017
DBW0 started with pid=17, OS id=8407
Sat Apr 08 16:25:49 2017
LGWR started with pid=18, OS id=8409
Sat Apr 08 16:25:49 2017
CKPT started with pid=19, OS id=8411
Sat Apr 08 16:25:49 2017
SMON started with pid=20, OS id=8413
Sat Apr 08 16:25:49 2017
RECO started with pid=21, OS id=8415
Sat Apr 08 16:25:49 2017
RBAL started with pid=22, OS id=8417
Sat Apr 08 16:25:49 2017
ASMB started with pid=23, OS id=8419
Sat Apr 08 16:25:50 2017
MMON started with pid=24, OS id=8421
NOTE: initiating MARK startup
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
Starting background process MARK
Sat Apr 08 16:25:50 2017
MMNL started with pid=25, OS id=8425
Sat Apr 08 16:25:50 2017
MARK started with pid=26, OS id=8427
NOTE: MARK has subscribed
starting up 1 shared server(s) ...
lmon registered with NM - instance number 1 (internal mem no 0)
Reconfiguration started (old inc 0, new inc 19)
List of instances:
 1 2 (myinst: 1)
 Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
 Communication channels reestablished
 * domain 0 valid according to instance 2
 * domain 0 valid = 1 according to instance 2
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Sat Apr 08 16:25:51 2017
LCK0 started with pid=30, OS id=8439
Starting background process RSMN
Sat Apr 08 16:25:51 2017
RSMN started with pid=31, OS id=8441
ORACLE_BASE not set in environment. It is recommended
that ORACLE_BASE be set in the environment
Sat Apr 08 16:25:52 2017
ALTER SYSTEM SET local_listener=' (ADDRESS=(PROTOCOL=TCP)(HOST=10.20.0.92)(PORT=1521))' SCOPE=MEMORY SID='rac1';
ALTER DATABASE MOUNT /* db agent *//* {0:1:7} */
NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
NOTE: Loaded library: System
SUCCESS: diskgroup DATA was mounted
NOTE: dependency between database rac and diskgroup resource ora.DATA.dg is established
Successful mount of redo thread 1, with mount id 2528695933
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Lost write protection disabled
Completed: ALTER DATABASE MOUNT /* db agent *//* {0:1:7} */
ALTER DATABASE OPEN /* db agent *//* {0:1:7} */
Picked broadcast on commit scheme to generate SCNs
Thread 1 opened at log sequence 200
  Current log# 2 seq# 200 mem# 0: +DATA/rac/onlinelog/group_2.262.937068817
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
SMON: enabling cache recovery
minact-scn: Inst 1 is a slave inc#:19 mmon proc-id:8421 status:0x2
minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000
[8443] Successfully onlined Undo Tablespace 2.
Undo initialization finished serial:0 start:1278034 end:1278424 diff:390 (3 seconds)
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
Database Characterset is AL32UTF8
No Resource Manager plan active
Starting background process GTX0
Sat Apr 08 16:25:59 2017
GTX0 started with pid=35, OS id=8465
Starting background process RCBG
Sat Apr 08 16:26:00 2017
RCBG started with pid=36, OS id=8467
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
Sat Apr 08 16:26:00 2017
QMNC started with pid=37, OS id=8469
Completed: ALTER DATABASE OPEN /* db agent *//* {0:1:7} */
Sat Apr 08 16:26:02 2017
Starting background process CJQ0
Sat Apr 08 16:26:02 2017
CJQ0 started with pid=42, OS id=8496
Sat Apr 08 16:27:27 2017
Shutting down instance (abort)
License high water mark = 4
USER (ospid: 8600): terminating the instance
Instance terminated by USER, pid = 8600
Sat Apr 08 16:27:28 2017
Instance shutdown complete

===========
Node 2

===========

Check the alert log as the database of the node 1 instance will be recover automatically

[oracle@test-rac2 trace]$ tail -100 alert_rac2.log

Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
Started redo scan
 Fix write in gcs resources
Reconfiguration complete
Completed redo scan
 read 180 KB redo, 54 data blocks need recovery
Started redo application at
 Thread 1: logseq 203, block 107
Recovery of Online Redo Log: Thread 1 Group 1 Seq 203 Reading mem 0
  Mem# 0: +DATA/rac/onlinelog/group_1.261.937068817
Completed redo application of 0.04MB
Completed instance recovery at
 Thread 1: logseq 203, block 468, scn 8677383
 53 data blocks read, 54 data blocks written, 180 redo k-bytes read
Thread 1 advanced to log sequence 204 (thread recovery)
Sat Apr 08 16:44:08 2017
minact-scn: Master considers inst:1 dead
Sat Apr 08 16:45:06 2017
Decreasing number of real time LMS from 1 to 0
Sat Apr 08 16:47:32 2017
Reconfiguration started (old inc 33, new inc 35)
List of instances:
 1 2 (myinst: 2)
 Global Resource Directory frozen
 Communication channels reestablished
Sat Apr 08 16:47:32 2017
 * domain 0 valid = 1 according to instance 1
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Sat Apr 08 16:47:32 2017
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Sat Apr 08 16:47:35 2017
minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:35 errcnt:0
Sat Apr 08 16:48:30 2017
Dumping diagnostic data in directory=[cdmp_20170408164830], requested by (instance=1, osid=10869 (LMD0)), summary=[abnormal instance termination].
Sat Apr 08 16:48:31 2017
Reconfiguration started (old inc 35, new inc 37)
List of instances:
 2 (myinst: 2)
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Sat Apr 08 16:48:31 2017
 LMS 0: 1 GCS shadows cancelled, 1 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Sat Apr 08 16:48:31 2017
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
Started redo scan
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
Completed redo scan
 read 56 KB redo, 47 data blocks need recovery
Started redo application at
 Thread 1: logseq 204, block 2, scn 8677834
Recovery of Online Redo Log: Thread 1 Group 2 Seq 204 Reading mem 0
  Mem# 0: +DATA/rac/onlinelog/group_2.262.937068817
Completed redo application of 0.03MB
Completed instance recovery at
 Thread 1: logseq 204, block 115, scn 8699281
 35 data blocks read, 47 data blocks written, 56 redo k-bytes read
Thread 1 advanced to log sequence 205 (thread recovery)
Sat Apr 08 16:48:32 2017
minact-scn: Master considers inst:1 dead
Reconfiguration started (old inc 37, new inc 39)
List of instances:
 1 2 (myinst: 2)
 Global Resource Directory frozen
 Communication channels reestablished
Sat Apr 08 16:48:41 2017
 * domain 0 valid = 1 according to instance 1
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:39 errcnt:0
[oracle@test-rac2 trace]$
[oracle@test-rac2 trace]$
[oracle@test-rac2 trace]$
[oracle@test-rac2 trace]$ tail -100 alert_rac2.log
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
Started redo scan
 Fix write in gcs resources
Reconfiguration complete
Completed redo scan
 read 180 KB redo, 54 data blocks need recovery
Started redo application at
 Thread 1: logseq 203, block 107
Recovery of Online Redo Log: Thread 1 Group 1 Seq 203 Reading mem 0
  Mem# 0: +DATA/rac/onlinelog/group_1.261.937068817
Completed redo application of 0.04MB
Completed instance recovery at
 Thread 1: logseq 203, block 468, scn 8677383
 53 data blocks read, 54 data blocks written, 180 redo k-bytes read
Thread 1 advanced to log sequence 204 (thread recovery)
Sat Apr 08 16:44:08 2017
minact-scn: Master considers inst:1 dead
Sat Apr 08 16:45:06 2017
Decreasing number of real time LMS from 1 to 0
Sat Apr 08 16:47:32 2017
Reconfiguration started (old inc 33, new inc 35)
List of instances:
 1 2 (myinst: 2)
 Global Resource Directory frozen
 Communication channels reestablished
Sat Apr 08 16:47:32 2017
 * domain 0 valid = 1 according to instance 1
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Sat Apr 08 16:47:32 2017
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Sat Apr 08 16:47:35 2017
minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:35 errcnt:0
Sat Apr 08 16:48:30 2017
Dumping diagnostic data in directory=[cdmp_20170408164830], requested by (instance=1, osid=10869 (LMD0)), summary=[abnormal instance termination].
Sat Apr 08 16:48:31 2017
Reconfiguration started (old inc 35, new inc 37)
List of instances:
 2 (myinst: 2)
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Sat Apr 08 16:48:31 2017
 LMS 0: 1 GCS shadows cancelled, 1 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Sat Apr 08 16:48:31 2017
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
Started redo scan
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
Completed redo scan
 read 56 KB redo, 47 data blocks need recovery
Started redo application at
 Thread 1: logseq 204, block 2, scn 8677834
Recovery of Online Redo Log: Thread 1 Group 2 Seq 204 Reading mem 0
  Mem# 0: +DATA/rac/onlinelog/group_2.262.937068817
Completed redo application of 0.03MB
Completed instance recovery at
 Thread 1: logseq 204, block 115, scn 8699281
 35 data blocks read, 47 data blocks written, 56 redo k-bytes read
Thread 1 advanced to log sequence 205 (thread recovery)
Sat Apr 08 16:48:32 2017
minact-scn: Master considers inst:1 dead
Reconfiguration started (old inc 37, new inc 39)
List of instances:
 1 2 (myinst: 2)
 Global Resource Directory frozen
 Communication channels reestablished
Sat Apr 08 16:48:41 2017
 * domain 0 valid = 1 according to instance 1
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:39 errcnt:0
Sat Apr 08 16:49:49 2017
Increasing number of real time LMS from 0 to 1

--------------------------------------------------END-----------------------------------------------------

=======================
3 - ASM INSTANCE FAILURE

=======================

Start the workload
  •   Kill the PMON Process of the ASM Instance
Expected Result
  • Its Similar to the DB Instance failure, but the difference is ASM Resource will be offline and will be automatically restarted by the cluster & the database in that instance will be shutdown abnormally
  • Instance recovery will be performed by reading the disk group log
  • Client connections are moved to surviving instances
  • Services will be moved to available instance
Example:

[oracle@test-rac1 ~]$ ps -ef | grep pmon

oracle    5505     1  0 Apr08 ?        00:00:11 asm_pmon_+ASM1
oracle   11104     1  0 Apr08 ?        00:00:13 ora_pmon_rac1
oracle   19210 19169  0 11:49 pts/0    00:00:00 grep pmon

Kill the asm PMON Process as shown below:

#kill -9 5505

Meanwhile go to the asm alert log file location as shown below

[oracle@test-rac1 trace]$ cd /u01/app/oracle/diag/asm/+asm/+ASM1/trace

[oracle@test-rac1 trace]$ pwd

/u01/app/oracle/diag/asm/+asm/+ASM1/trace

Check the alert log file in node 1

[oracle@test-rac1 trace]$ tail -f alert_+ASM1.log

NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_9969.trc
Sat Apr 08 16:47:32 2017
NOTE: client rac1:rac registered, osid 10900, mbr 0x1
Sat Apr 08 16:48:31 2017
NOTE: ASM client rac1:rac disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_10900.trc
Sat Apr 08 16:48:40 2017
NOTE: client rac1:rac registered, osid 11154, mbr 0x1
Sun Apr 09 11:59:05 2017
LMON (ospid: 5521): terminating the instance due to error 472
Sun Apr 09 11:59:05 2017
System state dump requested by (instance=1, osid=5521 (LMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_diag_5515_20170409115905.trc
Dumping diagnostic data in directory=[cdmp_20170409115905], requested by (instance=1, osid=5521 (LMON)), summary=[abnormal instance termination].
Instance terminated by LMON, pid = 5521
Sun Apr 09 11:59:08 2017
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 1, checking for the existence of node 0...
* node 0 does not exist. instance_number = 1
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 1
Private Interface 'eth2:1' configured from GPnP for use as a private interconnect.
  [name='eth2:1', type=1, ip=169.254.35.129, mac=00-50-56-b0-38-83, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'eth1' configured from GPnP for use as a public interface.
  [name='eth1', type=1, ip=10.20.0.90, mac=00-50-56-b0-77-f1, net=10.20.0.0/24, mask=255.255.255.0, use=public/1]
Public Interface 'eth1:1' configured from GPnP for use as a public interface.
  [name='eth1:1', type=1, ip=10.20.0.92, mac=00-50-56-b0-77-f1, net=10.20.0.0/24, mask=255.255.255.0, use=public/1]
Public Interface 'eth1:2' configured from GPnP for use as a public interface.
  [name='eth1:2', type=1, ip=10.20.0.94, mac=00-50-56-b0-77-f1, net=10.20.0.0/24, mask=255.255.255.0, use=public/1]
CELL communication is configured to use 0 interface(s):
CELL IP affinity details:
    NUMA status: non-NUMA system
    cellaffinity.ora status: N/A
CELL communication will use 1 IP group(s):
    Grp 0:
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/11.2.0/grid/dbs/arch
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
ORACLE_HOME = /u01/app/11.2.0/grid
System name:    Linux
Node name:      test-rac1.local
Release:        3.8.13-68.3.4.el6uek.x86_64
Version:        #2 SMP Tue Jul 14 15:03:36 PDT 2015
Machine:        x86_64
VM name:        VMWare Version: 6
Using parameter settings in server-side spfile +DATA/ractest-scan/asmparameterfile/registry.253.937065775
System parameters with non-default values:
  large_pool_size          = 12M
  instance_type            = "asm"
  remote_login_passwordfile= "EXCLUSIVE"
  asm_power_limit          = 1
  diagnostic_dest          = "/u01/app/oracle"
Cluster communication is configured to use the following interface(s) for this instance
  169.254.35.129
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
Sun Apr 09 11:59:09 2017
PMON started with pid=2, OS id=20252
Sun Apr 09 11:59:09 2017
PSP0 started with pid=3, OS id=20254
Sun Apr 09 11:59:10 2017
VKTM started with pid=4, OS id=20256 at elevated priority
VKTM running at (1)millisec precision with DBRM quantum (100)ms
Sun Apr 09 11:59:10 2017
GEN0 started with pid=5, OS id=20260
Sun Apr 09 11:59:10 2017
DIAG started with pid=6, OS id=20262
Sun Apr 09 11:59:10 2017
PING started with pid=7, OS id=20264
Sun Apr 09 11:59:10 2017
DIA0 started with pid=8, OS id=20266
Sun Apr 09 11:59:10 2017
LMON started with pid=9, OS id=20268
Sun Apr 09 11:59:10 2017
LMD0 started with pid=10, OS id=20270
* Load Monitor used for high load check
* New Low - High Load Threshold Range = [960 - 1280]
Sun Apr 09 11:59:10 2017
LMS0 started with pid=11, OS id=20272 at elevated priority
Sun Apr 09 11:59:10 2017
LMHB started with pid=12, OS id=20276
Sun Apr 09 11:59:10 2017
MMAN started with pid=13, OS id=20278
Sun Apr 09 11:59:10 2017
DBW0 started with pid=14, OS id=20280
Sun Apr 09 11:59:10 2017
LGWR started with pid=15, OS id=20282
Sun Apr 09 11:59:10 2017
CKPT started with pid=16, OS id=20284
Sun Apr 09 11:59:10 2017
SMON started with pid=17, OS id=20286
Sun Apr 09 11:59:10 2017
RBAL started with pid=18, OS id=20288
Sun Apr 09 11:59:10 2017
GMON started with pid=19, OS id=20290
Sun Apr 09 11:59:10 2017
MMON started with pid=20, OS id=20292
Sun Apr 09 11:59:10 2017
MMNL started with pid=21, OS id=20294
lmon registered with NM - instance number 1 (internal mem no 0)
Reconfiguration started (old inc 0, new inc 16)
ASM instance
List of instances:
 1 2 (myinst: 1)
 Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
 Communication channels reestablished
* allocate domain 1, invalid = TRUE
 * domain 0 valid = 1 according to instance 2
 * domain 1 valid = 1 according to instance 2
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Sun Apr 09 11:59:11 2017
LCK0 started with pid=22, OS id=20297
ORACLE_BASE not set in environment. It is recommended
that ORACLE_BASE be set in the environment
Sun Apr 09 11:59:12 2017
SQL> ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:9:5} */
NOTE: Diskgroup used for Voting files is:
         DATA
Diskgroup with spfile:DATA
Diskgroup used for OCR is:DATA
NOTE: cache registered group DATA number=1 incarn=0x1156f512
NOTE: cache began mount (not first) of group DATA number=1 incarn=0x1156f512
NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
NOTE: Assigning number (1,0) to disk (ORCL:DISK1)
NOTE: Assigning number (1,1) to disk (ORCL:DISK2)
NOTE: Assigning number (1,2) to disk (ORCL:DISK3)
GMON querying group 1 at 2 for pid 23, osid 20303
NOTE: cache opening disk 0 of grp 1: DISK1 label:DISK1
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache opening disk 1 of grp 1: DISK2 label:DISK2
NOTE: cache opening disk 2 of grp 1: DISK3 label:DISK3
NOTE: cache mounting (not first) external redundancy group 1/0x1156F512 (DATA)
kjbdomatt send to inst 2
NOTE: attached to recovery domain 1
NOTE: redo buffer size is 256 blocks (1053184 bytes)
NOTE: LGWR attempting to mount thread 1 for diskgroup 1 (DATA)
Process LGWR (pid 20282) is running at high priority QoS for Exadata I/O
NOTE: LGWR found thread 1 closed at ABA 9.741
NOTE: LGWR mounted thread 1 for diskgroup 1 (DATA)
NOTE: LGWR opening thread 1 at fcn 0.5550 ABA 10.742
NOTE: cache mounting group 1/0x1156F512 (DATA) succeeded
NOTE: cache ending mount (success) of group DATA number=1 incarn=0x1156f512
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1
SUCCESS: diskgroup DATA was mounted
SUCCESS: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:9:5} */
SQL> ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent *//* {0:9:5} */
SUCCESS: ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent *//* {0:9:5} */
NOTE: diskgroup resource ora.DATA.dg is online
Sun Apr 09 11:59:13 2017
ALTER SYSTEM SET local_listener=' (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.20.0.92)(PORT=1521))))' SCOPE=MEMORY SID='+ASM1';
NOTE: Attempting voting file refresh on diskgroup DATA
NOTE: Refresh completed on diskgroup DATA
. Found 1 voting file(s).
NOTE: Voting file relocation is required in diskgroup DATA
NOTE: Attempting voting file relocation on diskgroup DATA
NOTE: Successful voting file relocation on diskgroup DATA
Sun Apr 09 11:59:17 2017
Starting background process ASMB
Sun Apr 09 11:59:17 2017
ASMB started with pid=26, OS id=20345
Sun Apr 09 11:59:17 2017
NOTE: client +ASM1:+ASM registered, osid 20347, mbr 0x0
Sun Apr 09 11:59:19 2017

NOTE: client rac1:rac registered, osid 20411, mbr 0x1

The recovery of the ASM instance will start from the node 2 as shown below in alert log of the 
node 2

[oracle@test-rac2 trace]$ tail -f alert_+ASM2.log

 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Sun Apr 09 11:59:05 2017
Dumping diagnostic data in directory=[cdmp_20170409115905], requested by (instance=1, osid=5521 (LMON)), summary=[abnormal instance termination].
Sun Apr 09 11:59:07 2017
Reconfiguration started (old inc 12, new inc 14)
List of instances:
 2 (myinst: 2)
 Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Sun Apr 09 11:59:07 2017
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
Sun Apr 09 11:59:07 2017
NOTE: SMON starting instance recovery for group DATA domain 1 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: starting recovery of thread=1 ckpt=9.742 group=1 (DATA)
NOTE: SMON waiting for thread 1 recovery enqueue
NOTE: SMON about to begin recovery lock claims for diskgroup 1 (DATA)
NOTE: SMON successfully validated lock domain 1
NOTE: advancing ckpt for group 1 (DATA) thread=1 ckpt=9.742
NOTE: SMON did instance recovery for group DATA domain 1
Reconfiguration started (old inc 14, new inc 16)
List of instances:
1 2 (myinst: 2)
Global Resource Directory frozen
Communication channels reestablished
Sun Apr 09 11:59:11 2017
* domain 0 valid = 1 according to instance 1
* domain 1 valid = 1 according to instance 1
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete

Hence once the asm failure – It will automatically recover by node 2 and all the asm disk will be mounted & started automatically

===================END===============

=============================
4 – Local & SCAN Listener Failure

=============================

Kill local listener process

Kill scan listener

Expected result:
  •   New connection will be redirected to the second listener
  • Listener failure will be detected by CRSD & restarted automatically
==================END==============

=======================
5 – Public Network Failure

=======================

Unplug public network cable or bring the public network down by using the OS Level

Expected Result:
  •   VIP & SCAN VIP will fail to the surviving node
  •   DB Instance will be up & DB Service will fail to surviving node
  •   If TAF configured, client should fail to available instance

Now intentionally we are going to down the Public network IP down & see the changes, This also can occur in other scenario

[root@test-rac1 ~]# /sbin/ip link set eth1 down

(Before this verift the Public Network like its mention the eth0 or etho 1 etc)

Self Explanation:

Can verify in # cat /etc/hosts & # ipconfig – to verfiy the exact Public Network is configured
In this case once the public ip is down, All the client connection will be disconnected & we will not be able to access the server until we up the Public network manually

To bring up the server

[root@test-rac1 ~]# /sbin/ip link set eth1 up

After the scan, vip network service will be re-directed to node 2 automatically
The database of the node 1 is open but the listener is unreachable only on node 1 where us the node 2 the database & listener is up & running
Log before the network down

[oracle@test-rac1 trace]$ crs_stat -t

Name           Type           Target    State     Host
------------------------------------------------------------
ora.DATA.dg    ora....up.type ONLINE    ONLINE    test-rac1
ora....ER.lsnr ora....er.type ONLINE    ONLINE    test-rac1
ora....N1.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora....N2.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora....N3.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora.asm        ora.asm.type   ONLINE    ONLINE    test-rac1
ora.cvu        ora.cvu.type   ONLINE    ONLINE    test-rac2
ora.gsd        ora.gsd.type   OFFLINE   OFFLINE
ora....network ora....rk.type ONLINE    ONLINE    test-rac1
ora.oc4j       ora.oc4j.type  ONLINE    ONLINE    test-rac2
ora.ons        ora.ons.type   ONLINE    ONLINE    test-rac1
ora.rac.db     ora....se.type ONLINE    ONLINE    test-rac1
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora.scan2.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora.scan3.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora....SM1.asm application    ONLINE    ONLINE    test-rac1
ora....C1.lsnr application    ONLINE    ONLINE    test-rac1
ora....ac1.gsd application    OFFLINE   OFFLINE
ora....ac1.ons application    ONLINE    ONLINE    test-rac1
ora....ac1.vip ora....t1.type ONLINE    ONLINE    test-rac1
ora....SM2.asm application    ONLINE    ONLINE    test-rac2
ora....C2.lsnr application    ONLINE    ONLINE    test-rac2
ora....ac2.gsd application    OFFLINE   OFFLINE
ora....ac2.ons application    ONLINE    ONLINE    test-rac2
ora....ac2.vip ora....t1.type ONLINE    ONLINE    test-rac2

Log after the network down

[oracle@test-rac2 ~]$ crs_stat -t

Name           Type           Target    State     Host
------------------------------------------------------------
ora.DATA.dg    ora....up.type ONLINE    ONLINE    test-rac1
ora....ER.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora....N1.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora....N2.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora....N3.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora.asm        ora.asm.type   ONLINE    ONLINE    test-rac1
ora.cvu        ora.cvu.type   ONLINE    ONLINE    test-rac2
ora.gsd        ora.gsd.type   OFFLINE   OFFLINE
ora....network ora....rk.type ONLINE    ONLINE    test-rac2
ora.oc4j       ora.oc4j.type  ONLINE    ONLINE    test-rac2
ora.ons        ora.ons.type   ONLINE    ONLINE    test-rac2
ora.rac.db     ora....se.type ONLINE    ONLINE    test-rac1
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora.scan2.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora.scan3.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora....SM1.asm application    ONLINE    ONLINE    test-rac1
ora....C1.lsnr application    ONLINE    OFFLINE
ora....ac1.gsd application    OFFLINE   OFFLINE
ora....ac1.ons application    ONLINE    OFFLINE
ora....ac1.vip ora....t1.type ONLINE    ONLINE    test-rac2
ora....SM2.asm application    ONLINE    ONLINE    test-rac2
ora....C2.lsnr application    ONLINE    ONLINE    test-rac2
ora....ac2.gsd application    OFFLINE   OFFLINE
ora....ac2.ons application    ONLINE    ONLINE    test-rac2
ora....ac2.vip ora....t1.type ONLINE    ONLINE    test-rac2

===================END========================
=======================
6 - Private Network Failure
=======================

Unplug private network cable or bring the private network part down by using the OS Level

Expected Result:
  • The private network failure is very critical to have the fail
  • CSSD will detect a split brain situation & will survive the node with the lowest node number, second node will be evicted
  • The CRS, ASM & DB instance will shutdown
  • All the process will be terminated, if not the node will be rebooted
  • After reconnect, CRS Stack & Resources will be started
Self Explanation:

In this case the node 1 of private network is made to down & the node 2 will be rebooted automatically & check the status of the resources on node 1 so all the resources are fail over in node 1

Before the server fail

[oracle@test-rac1 trace]$ crs_stat -t

Name           Type           Target    State     Host
------------------------------------------------------------
ora.DATA.dg    ora....up.type ONLINE    ONLINE    test-rac1
ora....ER.lsnr ora....er.type ONLINE    ONLINE    test-rac1
ora....N1.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora....N2.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora....N3.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora.asm        ora.asm.type   ONLINE    ONLINE    test-rac1
ora.cvu        ora.cvu.type   ONLINE    ONLINE    test-rac2
ora.gsd        ora.gsd.type   OFFLINE   OFFLINE
ora....network ora....rk.type ONLINE    ONLINE    test-rac1
ora.oc4j       ora.oc4j.type  ONLINE    ONLINE    test-rac2
ora.ons        ora.ons.type   ONLINE    ONLINE    test-rac1
ora.rac.db     ora....se.type ONLINE    ONLINE    test-rac1
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora.scan2.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora.scan3.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora....SM1.asm application    ONLINE    ONLINE    test-rac1
ora....C1.lsnr application    ONLINE    ONLINE    test-rac1
ora....ac1.gsd application    OFFLINE   OFFLINE
ora....ac1.ons application    ONLINE    ONLINE    test-rac1
ora....ac1.vip ora....t1.type ONLINE    ONLINE    test-rac1
ora....SM2.asm application    ONLINE    ONLINE    test-rac2
ora....C2.lsnr application    ONLINE    ONLINE    test-rac2
ora....ac2.gsd application    OFFLINE   OFFLINE
ora....ac2.ons application    ONLINE    ONLINE    test-rac2
ora....ac2.vip ora....t1.type ONLINE    ONLINE    test-rac2

To down the server

#/sbin/ip link set eth2 down

(Before this verify the Private Network like its mention the eth1 or eth2 etc as its differ from the configuration)

Can verify in # cat /etc/hosts & # ipconfig – to verfiy the exact Private Network is configured

[oracle@test-rac2 ~]$ crs_stat -t

Name           Type           Target    State     Host
------------------------------------------------------------
ora.DATA.dg    ora....up.type ONLINE    ONLINE    test-rac1
ora....ER.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora....N1.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora....N2.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora....N3.lsnr ora....er.type ONLINE    ONLINE    test-rac2
ora.asm        ora.asm.type   ONLINE    ONLINE    test-rac1
ora.cvu        ora.cvu.type   ONLINE    ONLINE    test-rac2
ora.gsd        ora.gsd.type   OFFLINE   OFFLINE
ora....network ora....rk.type ONLINE    ONLINE    test-rac2
ora.oc4j       ora.oc4j.type  ONLINE    ONLINE    test-rac2
ora.ons        ora.ons.type   ONLINE    ONLINE    test-rac2
ora.rac.db     ora....se.type ONLINE    ONLINE    test-rac1
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora.scan2.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora.scan3.vip  ora....ip.type ONLINE    ONLINE    test-rac2
ora....SM1.asm application    ONLINE    ONLINE    test-rac1
ora....C1.lsnr application    ONLINE    OFFLINE
ora....ac1.gsd application    OFFLINE   OFFLINE
ora....ac1.ons application    ONLINE    OFFLINE
ora....ac1.vip ora....t1.type ONLINE    ONLINE    test-rac2
ora....SM2.asm application    ONLINE    ONLINE    test-rac2
ora....C2.lsnr application    ONLINE    ONLINE    test-rac2
ora....ac2.gsd application    OFFLINE   OFFLINE
ora....ac2.ons application    ONLINE    ONLINE    test-rac2
ora....ac2.vip ora....t1.type ONLINE    ONLINE    test-rac2

Self Explanation:

If the node 1 Private network is down the node 2 database instance & network will fail, The node 1 database is up but the listener is down, after the server gets rebooted evert thing will be started automatically

[oracle@test-rac2 ~]$ crs_stat -t

CRS-0184: Cannot communicate with the CRS daemon.

SQL> select status from v$instance;
select status from v$instance
*
ERROR at line 1:
ORA-03135: connection lost contact
Process ID: 17034
Session ID: 50 Serial numbers: 2969

Self Explanation:

If we bring down the private interconnect network of the second node – it does not matter private network of the 1st node fail or 2nd node fail – oracle will reboot the 2nd node only always
If the private network of the Node 2 fail or down then the node 2 itself will be restarted once up the server

#/sbin/ip link set eth2 up

[oracle@test-rac1 ~]$ crs_stat -t
Name           Type           Target    State     Host
------------------------------------------------------------
ora.DATA.dg    ora....up.type ONLINE    ONLINE    test-rac1
ora....ER.lsnr ora....er.type ONLINE    ONLINE    test-rac1
ora....N1.lsnr ora....er.type ONLINE    ONLINE    test-rac1
ora....N2.lsnr ora....er.type ONLINE    ONLINE    test-rac1
ora....N3.lsnr ora....er.type ONLINE    ONLINE    test-rac1
ora.asm        ora.asm.type   ONLINE    ONLINE    test-rac1
ora.cvu        ora.cvu.type   ONLINE    ONLINE    test-rac1
ora.gsd        ora.gsd.type   OFFLINE   OFFLINE
ora....network ora....rk.type ONLINE    ONLINE    test-rac1
ora.oc4j       ora.oc4j.type  ONLINE    OFFLINE
ora.ons        ora.ons.type   ONLINE    ONLINE    test-rac1
ora.rac.db     ora....se.type ONLINE    ONLINE    test-rac1
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    test-rac1
ora.scan2.vip  ora....ip.type ONLINE    ONLINE    test-rac1
ora.scan3.vip  ora....ip.type ONLINE    ONLINE    test-rac1
ora....SM1.asm application    ONLINE    ONLINE    test-rac1
ora....C1.lsnr application    ONLINE    ONLINE    test-rac1
ora....ac1.gsd application    OFFLINE   OFFLINE
ora....ac1.ons application    ONLINE    ONLINE    test-rac1
ora....ac1.vip ora....t1.type ONLINE    ONLINE    test-rac1
ora....ac2.vip ora....t1.type ONLINE    ONLINE    test-rac1
====================END======================

Comments

Popular posts from this blog

ORA-01110: data file 1: '/oradata/datafiles/system01.dbf'

Expdp from Higher version 12c (12.1.0.2) to lower version 11g (11.2.0.4)