RAC refuses to start – or how installing TSM caused a cluster outage

Some days back i setup a new server for a customer of mine. Installation and patching ran fine, cluster was stable – no issues.

Then suddenly the cluster started to act strange. Commands did not work anymore; after a reboot the closure refused to start. The log files showed:

[root@node1 trace]# cat crsctl_60236.trc
Trace file /u01/app/oracle/diag/crs/node1/crs/trace/crsctl_60236.trc
Oracle Database 12c Clusterware Release 12.1.0.2.0 - Production Copyright 1996, 2014 Oracle. All rights reserved.
2014-11-18 18:55:52.765853 :  OCRMSG:2014938688: prom_waitconnect: CONN NOT ESTABLISHED (0,9,1,2)
2014-11-18 18:55:52.765870 :  OCRMSG:2014938688: GIPC error [9] msg [gipcretPermissions]
   CLWAL:2014938688: clsw_Initialize: Error [32] from procr_init_ext
  CLWAL:2014938688: clsw_Initialize: Error [PROCL-32: Oracle High Availability Services on the local node is not running Messaging error [gipcretPermissions] [9]] from procr_init_ext
2014-11-18 18:55:52.766123 :    GPNP:2014938688: clsgpnpkww_initclswcx: [at clsgpnpkww.c:351] Result: (56) CLSGPNP_OCR_INIT. (:GPNP01201:)Failed to init CLSW-OLR context. CLSW Error (3): CLSW-3: Error in the cluster registry (OCR) layer. [32] [PROCL-32: Oracle High Availability Services on the local node is not running Messaging error [gipcretPermissions] [9]]
2014-11-18 18:55:52.781979 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: [network]  failed connect attempt endp 0x139a810 [000000000000003e] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-00000000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_)(GIPCID=00000000-00000000-0))', numPend 0, numReady 1, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x136c6e0, sendp 0x136c4a0 status 13flags 0xa02c0712, flags-2 0x1, usrFlags 0x14000 }, req 0x13684c0 [0000000000000048] { gipcConnectRequest : addr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_)(GIPCID=00000000-00000000-0))', parentEndp 0x139a810, ret gipcretPermissions (9), objFlags 0x0, reqFlags 0x2 }
2014-11-18 18:55:52.781998 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: slos op  :  sgipcnDSConnectHelper
2014-11-18 18:55:52.782001 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: slos dep :  Permission denied (13)
2014-11-18 18:55:52.782003 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: slos loc :  connect
2014-11-18 18:55:52.782009 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: slos info:  failed to /var/tmp/.oracle/sOCSSD_LL_node1_
2014-11-18 18:55:52.782215 : CSSCLNT:2014938688: clsssConnect: gipc request failed with 9 (0x31)
2014-11-18 18:55:52.782274 : CSSCLNT:2014938688: clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_)) failed, rc 9
2014-11-18 18:55:52.782442 :  CRSCTL:2014938688: GetParser::getClustermode: crsctl_cssctx failed with 6

The interesting bit is the “permission denied” thing. It most cases it points to a file system permission issue. As you can see from the code the affected directory is “/var/tmp/.oracle” which (at least on an Exacta) is a link to “/tmp”.

After looking carefully i saw that the directory permission for “/tmp” got changed:

drwxr-xr--   12 root root      4096 Oct  8 23:56 tmp

Which is not right! Luckily the fix is easy:

chmod 1777 /tmp

But what caused the issue? It turned out one of the things i´ve done was to install TSM. I´ve extracted the TAR archive in /tmp and installed it from there. I tried to reproduce the issue and succeeded indeed. By extracting the TAR file the directory permissions of the parent directory was changed:

BEFORE:

drwxrwxrwt   12 root root      4096 Nov 19 12:24 tmp

AFTER:

drwxr-xr--   12 root root      4096 Oct  8 23:56 tmp

Great work, IBM!

This entry was posted in Oracle in general. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *