Direct NFS: Failed to set socket buffer size.wtmax=[1048576] rtmax=[1048576], errno=-1 on Solaris

This error occured on Solaris (SPARC) using Direct NFS (dNFS). The MOS Note is incomplete and only recommends to increase the send and receive buffers, but the solution is below:

 

/usr/sbin/ndd -set /dev/tcp tcp_xmit_hiwat 1056768
/usr/sbin/ndd -set /dev/tcp tcp_recv_hiwat 1056768
/usr/sbin/ndd -set /dev/tcp tcp_max_buf 2097152

To check the values, use:

/usr/sbin/ndd /dev/tcp tcp_max_buf
/usr/sbin/ndd /dev/tcp tcp_xmit_hiwat
/usr/sbin/ndd /dev/tcp tcp_recv_hiwat

Exadata outage due to ASM disks missing

On some older Exadatas it can be that Grid Disks belonging to a cell are not found anymore. And not even FORCEing the moutn helps:

FAILGROUP HEADER_STATU COUNT(*)
------------------------------ ------------ ----------
EXA01CEL02 MEMBER 34
EXA01CEL03 MEMBER 34
EXA01CEL04 MEMBER 34
EXA01CEL05 MEMBER 34
EXA01CEL06 MEMBER 34
EXA01CEL07 MEMBER 34
EXA01CEL08 UNKNOWN 34
EXA01CEL09 UNKNOWN 34
EXA01CEL10 UNKNOWN 32
EXA01CEL11 UNKNOWN 34
EXA01CEL12 MEMBER 34
EXA01CEL13 MEMBER 34
EXA01CEL14 MEMBER 34



SQL> alter diskgroup DBFS_DG mount force;
alter diskgroup DBFS_DG mount force
*
ERROR at line 1:
ORA-15042: ASM disk "71" is missing from group number "1"
ORA-15042: ASM disk "70" is missing from group number "1"
ORA-15042: ASM disk "69" is missing from group number "1"
ORA-15042: ASM disk "68" is missing from group number "1"
ORA-15042: ASM disk "67" is missing from group number "1"
ORA-15042: ASM disk "66" is missing from group number "1"
ORA-15042: ASM disk "65" is missing from group number "1"
ORA-15042: ASM disk "64" is missing from group number "1"
ORA-15042: ASM disk "63" is missing from group number "1"
ORA-15042: ASM disk "62" is missing from group number "1"
ORA-15042: ASM disk "61" is missing from group number "1"
ORA-15042: ASM disk "60" is missing from group number "1"
ORA-15042: ASM disk "59" is missing from group number "1"
ORA-15042: ASM disk "58" is missing from group number "1"
ORA-15042: ASM disk "57" is missing from group number "1"
ORA-15042: ASM disk "56" is missing from group number "1"
ORA-15042: ASM disk "55" is missing from group number "1"
ORA-15042: ASM disk "54" is missing from group number "1"
ORA-15042: ASM disk "53" is missing from group number "1"
ORA-15042: ASM disk "52" is missing from group number "1"
ORA-15042: ASM disk "51" is missing from group number "1"
ORA-15042: ASM disk "50" is missing from group number "1"
ORA-15042: ASM disk "39" is missing from group number "1"
ORA-15042: ASM disk "38" is missing from group number "1"
ORA-15042: ASM disk "37" is missing from group number "1"
ORA-15042: ASM disk "36" is missing from group number "1"
ORA-15042: ASM disk "35" is missing from group number "1"
ORA-15042: ASM disk "34" is missing from group number "1"
ORA-15042: ASM disk "33" is missing from group number "1"
ORA-15042: ASM disk "32" is missing from group number "1"
ORA-15042: ASM disk "31" is missing from group number "1"
ORA-15042: ASM disk "30" is missing from group number "1"

This is fixed with Bug:

16769943 EXADATA: SYSTEM OUTAGE - FOUR CELL NODES IN UNKNOWN STATUS TO ASM

ORA-29760: instance_number parameter not specified — without INSTANCE_NUMBER set

Oracle is complaining about:

ORA-29760: instance_number parameter not specified

And you have triple checked that the INSTANCE_NUMBER parameter is specified, but you still get the error.

The reason for that is, that your ORACLE_SID contains a underscore, e.g. ORA_SB. Starting with Oracle 12c the *underscore* parameter is a reserved character to indicate a server pool in a RAC.
Blase do not use the underscore parameter in the SID and the problem is gone.

ORA-00902: invalid datatype on MDSYS.SDO_GEORASTER during CREATE TABLE in 12c

Since i haven´t found this on the web:

When you try to create a table with MDSYS.SGO_GEORASTER and you´re receiving an ORA-00902: invalid datapye try the following:

SQL> @create_tab.sql
        "GEORASTER" "MDSYS"."SDO_GEORASTER" ,
                            *
ERROR at line 12:
ORA-00902: invalid datatype

Try to grant EXECUTE permissions on the data type to the user creating the table.

RAC refuses to start – or how installing TSM caused a cluster outage

Some days back i setup a new server for a customer of mine. Installation and patching ran fine, cluster was stable – no issues.

Then suddenly the cluster started to act strange. Commands did not work anymore; after a reboot the closure refused to start. The log files showed:

[root@node1 trace]# cat crsctl_60236.trc
Trace file /u01/app/oracle/diag/crs/node1/crs/trace/crsctl_60236.trc
Oracle Database 12c Clusterware Release 12.1.0.2.0 - Production Copyright 1996, 2014 Oracle. All rights reserved.
2014-11-18 18:55:52.765853 :  OCRMSG:2014938688: prom_waitconnect: CONN NOT ESTABLISHED (0,9,1,2)
2014-11-18 18:55:52.765870 :  OCRMSG:2014938688: GIPC error [9] msg [gipcretPermissions]
   CLWAL:2014938688: clsw_Initialize: Error [32] from procr_init_ext
  CLWAL:2014938688: clsw_Initialize: Error [PROCL-32: Oracle High Availability Services on the local node is not running Messaging error [gipcretPermissions] [9]] from procr_init_ext
2014-11-18 18:55:52.766123 :    GPNP:2014938688: clsgpnpkww_initclswcx: [at clsgpnpkww.c:351] Result: (56) CLSGPNP_OCR_INIT. (:GPNP01201:)Failed to init CLSW-OLR context. CLSW Error (3): CLSW-3: Error in the cluster registry (OCR) layer. [32] [PROCL-32: Oracle High Availability Services on the local node is not running Messaging error [gipcretPermissions] [9]]
2014-11-18 18:55:52.781979 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: [network]  failed connect attempt endp 0x139a810 [000000000000003e] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-00000000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_)(GIPCID=00000000-00000000-0))', numPend 0, numReady 1, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x136c6e0, sendp 0x136c4a0 status 13flags 0xa02c0712, flags-2 0x1, usrFlags 0x14000 }, req 0x13684c0 [0000000000000048] { gipcConnectRequest : addr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_)(GIPCID=00000000-00000000-0))', parentEndp 0x139a810, ret gipcretPermissions (9), objFlags 0x0, reqFlags 0x2 }
2014-11-18 18:55:52.781998 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: slos op  :  sgipcnDSConnectHelper
2014-11-18 18:55:52.782001 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: slos dep :  Permission denied (13)
2014-11-18 18:55:52.782003 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: slos loc :  connect
2014-11-18 18:55:52.782009 : GIPCNET:2014938688:  gipcmodNetworkProcessConnect: slos info:  failed to /var/tmp/.oracle/sOCSSD_LL_node1_
2014-11-18 18:55:52.782215 : CSSCLNT:2014938688: clsssConnect: gipc request failed with 9 (0x31)
2014-11-18 18:55:52.782274 : CSSCLNT:2014938688: clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_)) failed, rc 9
2014-11-18 18:55:52.782442 :  CRSCTL:2014938688: GetParser::getClustermode: crsctl_cssctx failed with 6

The interesting bit is the “permission denied” thing. It most cases it points to a file system permission issue. As you can see from the code the affected directory is “/var/tmp/.oracle” which (at least on an Exacta) is a link to “/tmp”.

After looking carefully i saw that the directory permission for “/tmp” got changed:

drwxr-xr--   12 root root      4096 Oct  8 23:56 tmp

Which is not right! Luckily the fix is easy:

chmod 1777 /tmp

But what caused the issue? It turned out one of the things i´ve done was to install TSM. I´ve extracted the TAR archive in /tmp and installed it from there. I tried to reproduce the issue and succeeded indeed. By extracting the TAR file the directory permissions of the parent directory was changed:

BEFORE:

drwxrwxrwt   12 root root      4096 Nov 19 12:24 tmp

AFTER:

drwxr-xr--   12 root root      4096 Oct  8 23:56 tmp

Great work, IBM!

Parity-based redundancy (RAID5/6/triple parity and beyond) on BTRFS and MDADM (Dec 2014)

While reading the following posts (second post) from Adam Leventhal

I have wondered what is exactly the status of parity-based redundancy in general and triple-parity in particular on Linux in MDADM and BTRFS.

TL;DR: There are patches to extend the linux kernel to support up to 6 parity disks but BTRFS does not want them because it does not fit their “business case” and MDADM would want them but somebody needs to develop patches for the MDADM component. The kernel raid implementation is ready and usable. If someone volunteers to do this kind of work i would support with equipment and myself as a test ressource.

BTRFS has some preliminary support for RAID5/6 but it is completely unusable in it´s current state. Some highlights:

  • As of kernel version 3.14, it works if everything goes right (that is: no errors like failing disks or corruptions), but the error handling is still lacking.
  • scrub cannot fix issues with raid5/6 yet. This means that if you have any checksum problem, your filesystem will be in a bad state.
  • btrfs does not yet seem to recognise that if you remove a drive from an array and then later plug it back in the drive is then out of date.
  • btrfs does not handle very well a drive that is present but not working : for example, an attempt to remove a faulty drive from the array (btrfs device delete) fails because it causes or requires reading from the faulty drive itself and thus btrfs will continue to attempt to access the faulty drive forever.
  • … plus some minor but not to critical things

So much for the parity-based status of BTRFS in general – it is evident it is still in its infancy and needs to mature as in its current state it is unusable. (As of December 2014 i´ve seen some patches appearing on the mailing list intended to fix some of these problems but they are not yet in the code.)

During my research I also found some patches from Andrea Mazzoleni that would extend the kernels ability to have up to six parity disks – per file system. Patches to add this functionality to BTRFS were also supplied (see here for the patches).
For use cases where six parity disks are not needed you can of course also pick three or four parity disks – it´s up to you to decide. This would enable BTRFS to create rather large local file systems with a very good resiliency against data loss.

I have tested the patches and they work just fine. So i wondered why they were not added as they would provide a unique feature to BTRFS no other operating system or file system (not even ZFS) has.

My question on the mailing list and the answer can be see here:

LINK #1, LINK #2 and the answer here. Tests with the patches installed can be found here.

Still not satisfied with the answer i mailed some people offline and got the following responses i´d like to show here::

Our plan is based on the upper distributed fs or storage, which can provide
higher reliability than local filesystems, so we think RAID1/RAID10/RAID5/RAID6
is enough for us.

Your work is very very good, it just doesn’t fit our business case.

And (from a different person):

It would be there some day in the long run I guess. But for now
bringing in the support for enterprise use cases and stability
to the btrfs has been our focus.

If course I do agree with the point that stability matters most given the ongoing problems and bugs with BTRFS that can easily cause a loss of a whole file system (e.g. the most recent problems with snapshots in 3.16 and 3.17 kernels, system lockups, poor performance, space problems and so on) but if a file system also includes the functionality of a volume manager then it needs to provide at least some basic features such as:

  • parity-based redundancy
  • hot-spare disks or n-way-mirroring (see below)
  • mirroring
  • striping
  • n-way-mirroring if there is not support for hot-spare disks (that means more than one copy of the data to protect against a failure of more than one disk in a mirrored configuration)

Currently the *only* working implementation which offers resiliency against disk failures that BTRFS provides is mirroring. Ultimately this means RAID1 for a file system with exactly two disks or RAID1+0 for a file system with more than two disks. A failure of more than one disk is fatal in every case. Hot-spares? Not available and nobody is working on it. N-Way-Mirroring? Not available and nobody claimed it. RAID5/6? You guess it… not available/usable at the moment. No mirroring at all? That´s not what i´d use BTRFS for as the whole idea of a checksum based file system is to detect and repair corruptions which is impossible in a scenario with no redundancy.

The claimed on ongoing feature development for BTRFS can be found here. ##link here to the wiki list## As you see – nobody is working on hot-spare, n-way-mirroring support or RAID5/RAID6.

So returning to the answer I got offline i´d like to ask what the overall goal of BTRFS is and for whom it is developed – the so called “(our) business case” in the cited text. To me it seems the focus is not to be a general purpose linux file system; instead it seems to be more a file system for large enterprises where the data redundancy is already made at lower (= storage) levels. Support for large local file systems like ZFS offers is not available and (according to my impression) not desired.

Now if you read through all of this and you want to comment like

“… then develop and post patches – they´d be highly welcomed”

then I tell you: They are obviously not. At least on BTRFS.

The patches posted were working fine and can easily be adapted for the upstream kernel. Given the premature status of the RAID5/6 support in BTRFS it´s the ideal time to add them – before a lot of work is put into the RAID5/6 code. So why hesitating?

Neil Brown who is supporting the kernel functions for the raid code would send the patches upstream immediately – if someone (that means either MDADM itself or BTRFS) would use them. MDADM patches are not ready (yet?) and BTRFS does´t want them it seems.

So yeah.. so much for BTRFS and parity-based redundancy.

For MDADM it looks a bit better. The effort needed to change the code in MDADM is a non-trivial one and currently Andrea Mazzoleni who developed the patches does not have any time to work on a MDADM patch. I cannot do it myself as I am a pretty bad software developer… If you think you can do it i´d be happy to help you as a tester or with equipment so that at least MDADM get´s a much better parity implementation. Please contact me if you are willing to do this.

So which options are left for large local file systems on Linux?

You could either use ZFS on Linux which support up to three parity disks per VDEV (a number of disks are grouped into a VDEV onto which the chosen parity is applied) and then scale by adding more VDEVs – or you could use several smaller RAID6 (i would not recommend RAID5 due to the high risk of losing the raid if a disk fails and you need to rebuild) and group them together with LVM on top. Leaving the licensing discussion aside ZFS on Linux works quite well but requires a lot of manual work to get it working and running. If it runs it is stable – just make sure you use the latest ZFS code. On top of that ZFS is by far more mature than BTRFS while offering all of it´s features and more.

Solaris Live Network Bandwidth Monitoring

When doing performance analysis the current network throughput is often interesting. The following nice script display that – even without any root permissions.

#!/bin/sh
 # usage: netvolmon DEV [INTERVAL]
 DEV=$1
 IVAL=${2:-5}
getrxtx() {
 kstat -p "*:*:$1:*bytes64" |
 awk '{print $2}'
 }
rxtx=`getrxtx $DEV`
 while sleep $IVAL; do
 nrxtx=`getrxtx $DEV`
 (echo $IVAL $rxtx $nrxtx) |
 awk 'BEGIN {
 msg = "%6.2f MB/s RX %6.2f MB/s TX\n"}
 {rxd = ($4 - $2) / (1024*1024*$1);
 txd = ($5 - $3) / (1024*1024*$1);
 printf msg, rxd, txd}'
 rxtx="$nrxtx"
 done

The script is taken from here: http://utcc.utoronto.ca/~cks/space/blog/solaris/SolarisNetworkBandwidth

It produces the following output which is sufficient to get a quick overview about the current network traffic:

bash-3.2$ /export/home/oracle/net.sh igb0 5
  0.59 MB/s RX  20.54 MB/s TX
  1.17 MB/s RX  40.81 MB/s TX
  1.71 MB/s RX  59.72 MB/s TX

ORA-29275: partial multibyte character during RMAN resync

Today i had a smaller problem registering a newly created database in a rman catalog. The only unusual thing was that the database was using a unicode character set.

RMAN failed with:

 
[oracle@rac02a ~]$ rman target / catalog rman/rman@rman
Recovery Manager: Release 11.2.0.3.0 - Production on Wed Feb 12 05:31:49 2014
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
connected to target database: ORA11 (DBID=781875003)
connected to recovery catalog database
RMAN> show all;
starting full resync of recovery catalog
 RMAN-00571: ===========================================================
 RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
 RMAN-00571: ===========================================================
 RMAN-03002: failure of show command at 02/12/2014 05:31:57
 RMAN-03014: implicit resync of recovery catalog failed
 RMAN-03009: failure of full resync command on default channel at 02/12/2014 05:31:57
 ORA-29275: partial multibyte character

As you can see the database was using Unicode:

 [oracle@rac02a ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.3.0 Production on Wed Feb 12 05:32:23 2014
Copyright (c) 1982, 2011, Oracle.  All rights reserved.
Connected to:
 Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
 With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
 Data Mining and Real Application Testing options
SQL> select * from nls_database_parameters;
NLS_CHARACTERSET       AL32UTF8

So i changed the environment to match the character set of the database:

[oracle@rac02a ~]$ export NLS_LANG=GERMAN_GERMANY.AL32UTF8

And… Problem solved!

[oracle@rac02a ~]$ rman target / catalog rman/rman@rman
Recovery Manager: Release 11.2.0.3.0 - Production on Mi Feb 12 05:32:53 2014
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
Mit Ziel-Datenbank verbunden: ORA11 (DBID=781875003)
Verbindung mit Datenbank des Recovery-Katalogs
 RMAN> show all;
Vollständige Neusynchronisation des Recovery-Katalogs wird begonnen
 Vollständige Neusynchronisation abgeschlossen
 RMAN-Konfigurationsparameter für Datenbank mit db_unique_name ORA11 sind:
 CONFIGURE RETENTION POLICY TO REDUNDANCY 1; # default
 CONFIGURE BACKUP OPTIMIZATION OFF; # default
 CONFIGURE DEFAULT DEVICE TYPE TO DISK; # default
 CONFIGURE CONTROLFILE AUTOBACKUP ON;
 CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '/nfs/backup/rman/ORA11/%F';
 CONFIGURE DEVICE TYPE DISK PARALLELISM 1 BACKUP TYPE TO COPY;
 CONFIGURE DATAFILE BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
 CONFIGURE ARCHIVELOG BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
 CONFIGURE CHANNEL DEVICE TYPE DISK FORMAT   '/nfs/backup/rman/ORA11/%U';
 CONFIGURE MAXSETSIZE TO UNLIMITED; # default
 CONFIGURE ENCRYPTION FOR DATABASE OFF; # default
 CONFIGURE ENCRYPTION ALGORITHM 'AES128'; # default
 CONFIGURE COMPRESSION ALGORITHM 'BASIC' AS OF RELEASE 'DEFAULT' OPTIMIZE FOR LOAD TRUE ; # default
 CONFIGURE ARCHIVELOG DELETION POLICY TO NONE; # default
 CONFIGURE SNAPSHOT CONTROLFILE NAME TO '/nfs/backup/rman/ORA11/backup/snapcf_ORA11.f';

Adding more than 512 LUNs to Oracle Enterprise Linux

When dealing with large databases it might be neccessary to have more than 512 LUNs attached to the server. Most Linux distributions default to 512 luns max. This gets even worse with multipathing where each path is represented as a single lun.

Error messages like this indicate the limit was hit:

Dec 02 22:24:46 server kernel: scsi: host 2 channel 0 id 2 lun258 has a LUN larger than allowed by the host adapter
 Dec 02 22:24:46 server kernel: scsi: host 2 channel 0 id 2 lun259 has a LUN larger than allowed by the host adapter

So what is needed to change that limit?

First you need to modify the maximum numbers of LUNs allowed by the SCSI stack by adding the following line to /etc/modprobe.conf:

options scsi_mod max_luns=2048

Depending on the SCSI HBA used the driver module itself might need some adjustments. For QLOgic HBAs the following line ist required in /etc/modprobe.conf:

options lpfc lpfc_max_luns=2048

After rebooting or removing and re-inserting the module you should be able to attach 2048 LUNs.

By the way the linux device interface (including the multipath driver) supprts more than 254 LUNs sind Kernel 2.6:

[root@ora1 ~]# vgs 
 VG     #PV #LV #SN Attr   VSize   VFree 
 vgroot   1 388   0 wz--n- 179,88G 94,50G

As you can see the VG has 388 LVs. More than 254 … so we actually should run out of minor numbers in /dev, right? Actually major- and minor numbers were increaed to 10 bits in kernel 2.6:

[root@ora1 mapper]# ls -la /dev/mapper/vgroot-disk00*
brw-rw---- 1 root disk 253,  6 24. Dez 05:27 /dev/mapper/vgroot-disk001
brw-rw---- 1 root disk 253,  7 24. Dez 05:27 /dev/mapper/vgroot-disk002
brw-rw---- 1 root disk 253,  8 24. Dez 05:27 /dev/mapper/vgroot-disk003
brw-rw---- 1 root disk 253, 308 24. Dez 05:27 /dev/mapper/vgroot-disk306
brw-rw---- 1 root disk 253, 309 24. Dez 05:27 /dev/mapper/vgroot-disk307
brw-rw---- 1 root disk 253, 310 24. Dez 05:27 /dev/mapper/vgroot-disk308
brw-rw---- 1 root disk 253, 311 24. Dez 05:27 /dev/mapper/vgroot-disk309