Direct NFS: Failed to set socket buffer size.wtmax=[1048576] rtmax=[1048576], errno=-1 on Solaris

This error occured on Solaris (SPARC) using Direct NFS (dNFS). The MOS Note is incomplete and only recommends to increase the send and receive buffers, but the solution is below:

 

/usr/sbin/ndd -set /dev/tcp tcp_xmit_hiwat 1056768
/usr/sbin/ndd -set /dev/tcp tcp_recv_hiwat 1056768
/usr/sbin/ndd -set /dev/tcp tcp_max_buf 2097152

To check the values, use:

/usr/sbin/ndd /dev/tcp tcp_max_buf
/usr/sbin/ndd /dev/tcp tcp_xmit_hiwat
/usr/sbin/ndd /dev/tcp tcp_recv_hiwat

Exadata outage due to ASM disks missing

On some older Exadatas it can be that Grid Disks belonging to a cell are not found anymore. And not even FORCEing the moutn helps:

FAILGROUP HEADER_STATU COUNT(*)
------------------------------ ------------ ----------
EXA01CEL02 MEMBER 34
EXA01CEL03 MEMBER 34
EXA01CEL04 MEMBER 34
EXA01CEL05 MEMBER 34
EXA01CEL06 MEMBER 34
EXA01CEL07 MEMBER 34
EXA01CEL08 UNKNOWN 34
EXA01CEL09 UNKNOWN 34
EXA01CEL10 UNKNOWN 32
EXA01CEL11 UNKNOWN 34
EXA01CEL12 MEMBER 34
EXA01CEL13 MEMBER 34
EXA01CEL14 MEMBER 34



SQL> alter diskgroup DBFS_DG mount force;
alter diskgroup DBFS_DG mount force
*
ERROR at line 1:
ORA-15042: ASM disk "71" is missing from group number "1"
ORA-15042: ASM disk "70" is missing from group number "1"
ORA-15042: ASM disk "69" is missing from group number "1"
ORA-15042: ASM disk "68" is missing from group number "1"
ORA-15042: ASM disk "67" is missing from group number "1"
ORA-15042: ASM disk "66" is missing from group number "1"
ORA-15042: ASM disk "65" is missing from group number "1"
ORA-15042: ASM disk "64" is missing from group number "1"
ORA-15042: ASM disk "63" is missing from group number "1"
ORA-15042: ASM disk "62" is missing from group number "1"
ORA-15042: ASM disk "61" is missing from group number "1"
ORA-15042: ASM disk "60" is missing from group number "1"
ORA-15042: ASM disk "59" is missing from group number "1"
ORA-15042: ASM disk "58" is missing from group number "1"
ORA-15042: ASM disk "57" is missing from group number "1"
ORA-15042: ASM disk "56" is missing from group number "1"
ORA-15042: ASM disk "55" is missing from group number "1"
ORA-15042: ASM disk "54" is missing from group number "1"
ORA-15042: ASM disk "53" is missing from group number "1"
ORA-15042: ASM disk "52" is missing from group number "1"
ORA-15042: ASM disk "51" is missing from group number "1"
ORA-15042: ASM disk "50" is missing from group number "1"
ORA-15042: ASM disk "39" is missing from group number "1"
ORA-15042: ASM disk "38" is missing from group number "1"
ORA-15042: ASM disk "37" is missing from group number "1"
ORA-15042: ASM disk "36" is missing from group number "1"
ORA-15042: ASM disk "35" is missing from group number "1"
ORA-15042: ASM disk "34" is missing from group number "1"
ORA-15042: ASM disk "33" is missing from group number "1"
ORA-15042: ASM disk "32" is missing from group number "1"
ORA-15042: ASM disk "31" is missing from group number "1"
ORA-15042: ASM disk "30" is missing from group number "1"

This is fixed with Bug:

16769943 EXADATA: SYSTEM OUTAGE - FOUR CELL NODES IN UNKNOWN STATUS TO ASM

ORA-29760: instance_number parameter not specified — without INSTANCE_NUMBER set

Oracle is complaining about:

ORA-29760: instance_number parameter not specified

And you have triple checked that the INSTANCE_NUMBER parameter is specified, but you still get the error.

The reason for that is, that your ORACLE_SID contains a underscore, e.g. ORA_SB. Starting with Oracle 12c the *underscore* parameter is a reserved character to indicate a server pool in a RAC.
Blase do not use the underscore parameter in the SID and the problem is gone.

ORA-00902: invalid datatype on MDSYS.SDO_GEORASTER during CREATE TABLE in 12c

Since i haven´t found this on the web:

When you try to create a table with MDSYS.SGO_GEORASTER and you´re receiving an ORA-00902: invalid datapye try the following:

SQL> @create_tab.sql
        "GEORASTER" "MDSYS"."SDO_GEORASTER" ,
                            *
ERROR at line 12:
ORA-00902: invalid datatype

Try to grant EXECUTE permissions on the data type to the user creating the table.

RAC refuses to start – or how installing TSM caused a cluster outage

Some days back i setup a new server for a customer of mine. Installation and patching ran fine, cluster was stable – no issues.

Then suddenly the cluster started to act strange. Commands did not work anymore; after a reboot the closure refused to start. The log files showed:

[root@node1 trace]# cat crsctl_60236.trc
Trace file /u01/app/oracle/diag/crs/node1/crs/trace/crsctl_60236.trc
Oracle Database 12c Clusterware Release 12.1.0.2.0 - Production Copyright 1996, 2014 Oracle. All rights reserved.
2014-11-18 18:55:52.765853 :  OCRMSG:2014938688: prom_waitconnect: CONN NOT ESTABLISHED (0,9,1,2)
2014-11-18 18:55:52.765870 :  OCRMSG:2014938688: GIPC error [9] msg [gipcretPermissions]
   CLWAL:2014938688: clsw_Initialize: Error [32] from procr_init_ext
  CLWAL:2014938688: clsw_Initialize: Error [PROCL-32: Oracle High Availability Services on the local node is not running Messaging error [gipcretPermissions] [9]] from procr_init_ext
2014-11-18 18:55:52.766123 :    GPNP:2014938688: clsgpnpkww_initclswcx: [at clsgpnpkww.c:351] Result: (56) CLSGPNP_OCR_INIT. (:GPNP01201:)Failed to init CLSW-OLR context. CLSW Error (3): CLSW-3: Error in the cluster registry (OCR) layer. [32] [PROCL-32: Oracle High Availability Services on the local node is not running Messaging error [gipcretPermissions] [9]]
2014-11-18 18:55:52.781979 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: [network]  failed connect attempt endp 0x139a810 [000000000000003e] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-00000000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_)(GIPCID=00000000-00000000-0))', numPend 0, numReady 1, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x136c6e0, sendp 0x136c4a0 status 13flags 0xa02c0712, flags-2 0x1, usrFlags 0x14000 }, req 0x13684c0 [0000000000000048] { gipcConnectRequest : addr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_)(GIPCID=00000000-00000000-0))', parentEndp 0x139a810, ret gipcretPermissions (9), objFlags 0x0, reqFlags 0x2 }
2014-11-18 18:55:52.781998 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: slos op  :  sgipcnDSConnectHelper
2014-11-18 18:55:52.782001 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: slos dep :  Permission denied (13)
2014-11-18 18:55:52.782003 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: slos loc :  connect
2014-11-18 18:55:52.782009 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: slos info:  failed to /var/tmp/.oracle/sOCSSD_LL_node1_
2014-11-18 18:55:52.782215 : CSSCLNT:2014938688: clsssConnect: gipc request failed with 9 (0x31)
2014-11-18 18:55:52.782274 : CSSCLNT:2014938688: clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_)) failed, rc 9
2014-11-18 18:55:52.782442 :  CRSCTL:2014938688: GetParser::getClustermode: crsctl_cssctx failed with 6

The interesting bit is the “permission denied” thing. It most cases it points to a file system permission issue. As you can see from the code the affected directory is “/var/tmp/.oracle” which (at least on an Exacta) is a link to “/tmp”.

After looking carefully i saw that the directory permission for “/tmp” got changed:

drwxr-xr--   12 root root      4096 Oct  8 23:56 tmp

Which is not right! Luckily the fix is easy:

chmod 1777 /tmp

But what caused the issue? It turned out one of the things i´ve done was to install TSM. I´ve extracted the TAR archive in /tmp and installed it from there. I tried to reproduce the issue and succeeded indeed. By extracting the TAR file the directory permissions of the parent directory was changed:

BEFORE:

drwxrwxrwt   12 root root      4096 Nov 19 12:24 tmp

AFTER:

drwxr-xr--   12 root root      4096 Oct  8 23:56 tmp

Great work, IBM!

Solaris Live Network Bandwidth Monitoring

When doing performance analysis the current network throughput is often interesting. The following nice script display that – even without any root permissions.

#!/bin/sh
 # usage: netvolmon DEV [INTERVAL]
 DEV=$1
 IVAL=${2:-5}
getrxtx() {
 kstat -p "*:*:$1:*bytes64" |
 awk '{print $2}'
 }
rxtx=`getrxtx $DEV`
 while sleep $IVAL; do
 nrxtx=`getrxtx $DEV`
 (echo $IVAL $rxtx $nrxtx) |
 awk 'BEGIN {
 msg = "%6.2f MB/s RX %6.2f MB/s TX\n"}
 {rxd = ($4 - $2) / (1024*1024*$1);
 txd = ($5 - $3) / (1024*1024*$1);
 printf msg, rxd, txd}'
 rxtx="$nrxtx"
 done

The script is taken from here: http://utcc.utoronto.ca/~cks/space/blog/solaris/SolarisNetworkBandwidth

It produces the following output which is sufficient to get a quick overview about the current network traffic:

bash-3.2$ /export/home/oracle/net.sh igb0 5
  0.59 MB/s RX  20.54 MB/s TX
  1.17 MB/s RX  40.81 MB/s TX
  1.71 MB/s RX  59.72 MB/s TX

ORA-29275: partial multibyte character during RMAN resync

Today i had a smaller problem registering a newly created database in a rman catalog. The only unusual thing was that the database was using a unicode character set.

RMAN failed with:

 
[oracle@rac02a ~]$ rman target / catalog rman/rman@rman
Recovery Manager: Release 11.2.0.3.0 - Production on Wed Feb 12 05:31:49 2014
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
connected to target database: ORA11 (DBID=781875003)
connected to recovery catalog database
RMAN> show all;
starting full resync of recovery catalog
 RMAN-00571: ===========================================================
 RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
 RMAN-00571: ===========================================================
 RMAN-03002: failure of show command at 02/12/2014 05:31:57
 RMAN-03014: implicit resync of recovery catalog failed
 RMAN-03009: failure of full resync command on default channel at 02/12/2014 05:31:57
 ORA-29275: partial multibyte character

As you can see the database was using Unicode:

 [oracle@rac02a ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.3.0 Production on Wed Feb 12 05:32:23 2014
Copyright (c) 1982, 2011, Oracle.  All rights reserved.
Connected to:
 Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
 With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
 Data Mining and Real Application Testing options
SQL> select * from nls_database_parameters;
NLS_CHARACTERSET       AL32UTF8

So i changed the environment to match the character set of the database:

[oracle@rac02a ~]$ export NLS_LANG=GERMAN_GERMANY.AL32UTF8

And… Problem solved!

[oracle@rac02a ~]$ rman target / catalog rman/rman@rman
Recovery Manager: Release 11.2.0.3.0 - Production on Mi Feb 12 05:32:53 2014
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
Mit Ziel-Datenbank verbunden: ORA11 (DBID=781875003)
Verbindung mit Datenbank des Recovery-Katalogs
 RMAN> show all;
Vollständige Neusynchronisation des Recovery-Katalogs wird begonnen
 Vollständige Neusynchronisation abgeschlossen
 RMAN-Konfigurationsparameter für Datenbank mit db_unique_name ORA11 sind:
 CONFIGURE RETENTION POLICY TO REDUNDANCY 1; # default
 CONFIGURE BACKUP OPTIMIZATION OFF; # default
 CONFIGURE DEFAULT DEVICE TYPE TO DISK; # default
 CONFIGURE CONTROLFILE AUTOBACKUP ON;
 CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '/nfs/backup/rman/ORA11/%F';
 CONFIGURE DEVICE TYPE DISK PARALLELISM 1 BACKUP TYPE TO COPY;
 CONFIGURE DATAFILE BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
 CONFIGURE ARCHIVELOG BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
 CONFIGURE CHANNEL DEVICE TYPE DISK FORMAT   '/nfs/backup/rman/ORA11/%U';
 CONFIGURE MAXSETSIZE TO UNLIMITED; # default
 CONFIGURE ENCRYPTION FOR DATABASE OFF; # default
 CONFIGURE ENCRYPTION ALGORITHM 'AES128'; # default
 CONFIGURE COMPRESSION ALGORITHM 'BASIC' AS OF RELEASE 'DEFAULT' OPTIMIZE FOR LOAD TRUE ; # default
 CONFIGURE ARCHIVELOG DELETION POLICY TO NONE; # default
 CONFIGURE SNAPSHOT CONTROLFILE NAME TO '/nfs/backup/rman/ORA11/backup/snapcf_ORA11.f';

Adding more than 512 LUNs to Oracle Enterprise Linux

When dealing with large databases it might be neccessary to have more than 512 LUNs attached to the server. Most Linux distributions default to 512 luns max. This gets even worse with multipathing where each path is represented as a single lun.

Error messages like this indicate the limit was hit:

Dec 02 22:24:46 server kernel: scsi: host 2 channel 0 id 2 lun258 has a LUN larger than allowed by the host adapter
 Dec 02 22:24:46 server kernel: scsi: host 2 channel 0 id 2 lun259 has a LUN larger than allowed by the host adapter

So what is needed to change that limit?

First you need to modify the maximum numbers of LUNs allowed by the SCSI stack by adding the following line to /etc/modprobe.conf:

options scsi_mod max_luns=2048

Depending on the SCSI HBA used the driver module itself might need some adjustments. For QLOgic HBAs the following line ist required in /etc/modprobe.conf:

options lpfc lpfc_max_luns=2048

After rebooting or removing and re-inserting the module you should be able to attach 2048 LUNs.

By the way the linux device interface (including the multipath driver) supprts more than 254 LUNs sind Kernel 2.6:

[root@ora1 ~]# vgs 
 VG     #PV #LV #SN Attr   VSize   VFree 
 vgroot   1 388   0 wz--n- 179,88G 94,50G

As you can see the VG has 388 LVs. More than 254 … so we actually should run out of minor numbers in /dev, right? Actually major- and minor numbers were increaed to 10 bits in kernel 2.6:

[root@ora1 mapper]# ls -la /dev/mapper/vgroot-disk00*
brw-rw---- 1 root disk 253,  6 24. Dez 05:27 /dev/mapper/vgroot-disk001
brw-rw---- 1 root disk 253,  7 24. Dez 05:27 /dev/mapper/vgroot-disk002
brw-rw---- 1 root disk 253,  8 24. Dez 05:27 /dev/mapper/vgroot-disk003
brw-rw---- 1 root disk 253, 308 24. Dez 05:27 /dev/mapper/vgroot-disk306
brw-rw---- 1 root disk 253, 309 24. Dez 05:27 /dev/mapper/vgroot-disk307
brw-rw---- 1 root disk 253, 310 24. Dez 05:27 /dev/mapper/vgroot-disk308
brw-rw---- 1 root disk 253, 311 24. Dez 05:27 /dev/mapper/vgroot-disk309

Copy permissions

Somethimes – especially on RACs – it is required to synchronize permissions on an ORACLE_HOME  running on one node with the permissions running on the other node.

 

Today i had the problem that after a patch all permissions were garbled. In order to restore service i had to fix them. The first idea here would be to copy the permissions from a working node and here is how you do it easy:

find /data/oragrid/product/11.2.0.3 -printf 'chown %U.%G %p\n' > /tmp/chown-home.sh
find /data/oragrid/product/11.2.0.3 -type d -printf "chmod %m %p \n" > /tmp/chmod-home.sh
find /data/oragrid/product/11.2.0.3 -type f -printf "chmod %m %p \n" > /tmp/chmod-home.sh

The scripts output for the file permissions looks like this:

chmod 775 /data/oragrid/product/11.2.0.3
chmod 755 /data/oragrid/product/11.2.0.3/patches
chmod 755 /data/oragrid/product/11.2.0.3/patches/p12989056
chmod 755 /data/oragrid/product/11.2.0.3/patches/p12989056/12989056
chmod 755 /data/oragrid/product/11.2.0.3/patches/p12989056/12989056/custom

and for the ownership:

chown 0.30275 /data/oragrid/product/11.2.0.3
chown 20341.30275 /data/oragrid/product/11.2.0.3/JRE
chown 20341.30275 /data/oragrid/product/11.2.0.3/patches
chown 20341.30275 /data/oragrid/product/11.2.0.3/patches/p12989056
chown 20341.30275 /data/oragrid/product/11.2.0.3/patches/p12989056/12989056

Just execute the scipts in any order and you are done.